syslog doesn't have anything, these are the last few lines
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Tue, Apr 30
Mon, Apr 29
I see 2 different concerns in this task
Plan LGTM (with the caveat that Janis mentioned regarding using base.name.release. Happy to review patches!
Fri, Apr 26
Cluster and listener envoy snippets, specific to search at P61253
Moving to serviceops-radar, this seems to be more related to the elasticsearch infra than wikikube or appservers.
I 've just had to revert the last action from this task, namely pooling of elastic110[3-7]. More info in T363521.
This https://sal.toolforge.org/log/TXIJEo8BGiVuUzOdIZbf lines up perfectly with the beginning of the errors, so I just reverted it.
This is pretty concerning. What we also see is unknown: Status code 503; upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 113 which means usually that the destination is unreachable, aka EHOSTUNREACH
In T363521#9746443, @EBernhardson wrote:Looking at the Host overview dashboard for mwmaint1002 for today can see that there were intermittent network errors from 03:00 until 06:50.
Thu, Apr 25
In T363399#9745051, @MoritzMuehlenhoff wrote:Will parsoidtest1001 be installed with Bullseye? scandium is currently running buster, but all the mediawiki manifests are compatible with bullseye (cloudweb already runs it), and so is the component/php74.
Adding @subbu for their information.
Wed, Apr 24
Since mesh.configuration 1.7, envoy on WikiKube and other kubernetes clusters listens on IPv6 and IPv4 for both the TLS terminator and the service mesh listeners. Charts are slowly being updated. On the kubernetes side, once all charts are updated, we 'll be done.
I 've also just run kubelet 1.23 in standalone mode talking to containerd and indeed processes in containers run with cri-containerd.apparmor.d apparmor profile.
Adding for bookwork
Adding as info since it was requested in T362408#9712356
In T362408#9712356, @JMeybohm wrote:@akosiaris could you please double check in your test environment that containerd will still enforce the default apparmor profile (see Remove apparmor.security.beta.kubernetes.io/defaultProfileName in T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21) like docker currently does?
Tue, Apr 23
I am resolving, hopefully we won't see a recurrence.
I 've just uncordoned it, it should receive mediawiki payloads in the next deployment. I 've also checked and it's again a scap target for kubernetes-workers group.
In T363086#9735726, @Jclark-ctr wrote:@akosiaris @hashar reset idrac with no change i will need to reboot server and hook crash cart up to it. Please advise if i am able to reboot.
In T362681#9729331, @MoritzMuehlenhoff wrote:That's not problem. We should just use the nodesource packages for this, we've been doing the same for "intermediate LTSes" before (e.g. node 16 or node 14) not covered by an intree Debian nodejs version. I'll work on this next week.
In T363086#9734314, @hashar wrote:parse1002.eqiad.wmnet is down / unreachable but is still in the pool of hosts to deploy tool. That has caused the MediaWiki train to fail over night and is causing every MediaWiki deployment to error out due to a timeout when trying reach that host.
Can one please remove the host from the pool of MediaWiki target hosts? Thanks!
Thu, Apr 18
nodejs20 isn't even on trixie/sid right now https://packages.debian.org/trixie/nodejs, https://packages.debian.org/sid/nodejs but only in experimental.
Commenting here as well at the request of @Ottomata in T249745#9725953
In T249745#9723704, @Ottomata wrote:
Wed, Apr 17
In T249745#9722942, @Ottomata wrote:see the CAP theorem
C != eventual-C. Eventual Consistency + AP is feasible and done often.
- We have staging base images and staging service images updated daily based on what is in Debian and apt.wikimedia.org
Mon, Apr 15
The immediate issue blocking the train has been resolved and new images have been pushed. Hence, lowering to High. There's a tail of images being rebuilt still and it's going to take a while longer, but this is no longer a UBN
Fri, Apr 12
Thanks for tackling this!
Thu, Apr 11
I don't think SRE has ever administrated Google Postmaster Tools at all. In fact, a quick cross check in the team showcases almost utter ignorance of the product, although we 'll ask internally a bit more. May I suggest reaching out to ITS too?
Apr 9 2024
In T360636#9700038, @MoritzMuehlenhoff wrote:In T360636#9698325, @akosiaris wrote:I 'll finish parsoid and testreduce in T359387
If I'm not mistaken testreduce is still unrelated, it's for the round trip tests that have been split off to a separate Ganeti VM some time ago (and was moved to Bookworm due to nodejs requirements last year)?
Apr 8 2024
I 'll finish parsoid and testreduce in T359387
In T249745#9640217, @Ottomata wrote:This doesn't mean that MediaWiki shoudn't try to improve the situation by handling the failure to submit a job by saving it somewhere (a specific db table?) and we can replay them later. At the current failure rate, this would guarantee the jobs would be executed with an irrelevant cost in terms of resources.
@Joe this sounds sort of similar to the Outbox solution described in T120242, albeit only for failed submissions instead of all of them. Functionally this sounds like a nice solution to the eventual consistency problem described there, but I'd expect it would add some latency to the user response (waiting for ACK from EventGate+Kafka). Actually it sounds more like this (discarded?) solution, except:
Apr 4 2024
In T361483, I 've been poking into selectively killing parts of changeprop that are no longer used. I am still in the /hopefully easy pickings/ phase, attacking things we KNOW aren't used any more. I am now targetting removing functionality from changeprop that refreshes all the mobile-sections parts of RESTBase, meaning RESTBase will no longer have up to date content for these endpoints.
In T361483#9680093, @elukey wrote:In T361483#9680024, @akosiaris wrote:In T361483#9679703, @elukey wrote:Hello!
There ores_cache job should be defined but disabled in the running config, we don't use it anymore and IIRC it is not running anymore in ChangeProp (lemme know otherwise).
You are correct. I 'll post a patch then to remove it. Thanks!
For Lift Wing, we just use CP to call inference.discovery.wmnet, no restbase involved. The idea is to create streams like "for every rev-id, get a score from a Lift Wing model".
This is probably something we want to move away from Changeprop then and in the jobqueue (same software, I know, but a different installation). Looking at the config, I think that there is no code that is specific to LiftWing, just standard reaction to events on kafka.
No problem for me! I can only see one issue, and this is something not specific to our topics: if we start another job in cp-jobqueue, the kafka consumer offset will be reset to whatever is the last element in the topic, and we'll potentially loose events in the stream. It is not a huge deal since at the moment nothing incredibly critical relies on them, but IIRC Search uses one of the running topics to update Elastic Search. If we move everything over we'd need to sync with them and figure out if a "hole" in the stream is acceptable, otherwise the only thing that I can think of is:
- stop the changeprop rule for the lift wing topic that Search uses.
- write down the offset of the related consumer group using the kafka api (IIRC it should be possible)
- create another consumer group in cp-jobqueue with the same initial offset (this is not super difficult but I have never done it).
- add the rule to cp-jobqueue and check if it works.
I guess it's about time I ask if it is ok to remove those exceptions now and return 403 to everyone for these endpoints.
Moving it to our radar too as we intend to revisit various parts of all of this (e.g. how we do MultiVersion once we are no longer constrained by the legacy infra), but we don't have something concrete right now.
Apr 2 2024
In T309772#9680594, @Mvolz wrote:The last remaining original Services member left in 2022.
- Use qemu to run x86_64 containers on an aarch64 VM
In T361483#9679703, @elukey wrote:Hello!
There ores_cache job should be defined but disabled in the running config, we don't use it anymore and IIRC it is not running anymore in ChangeProp (lemme know otherwise).
Apr 1 2024
Let's start with the "easy" ones. I see feature flags in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/changeprop/templates/_config.yaml for
LWN has an article titled "The race to replace Redis". I am not going to link directly as it is LWN subscriber only content but I can summarize (note I am pasting links in their entirety on purpose) the "Forks and alternatives" section:
Mar 22 2024
I 've already left various comments on the 2 docs. I am still going through the Miro board, but I can summarize the following:
Hi @MShilova_WMF. This is on my list for today, it might spill into early next week though. I 've started the review but I don't see to have access to T358115 (linked from the description), could you please grant me access?
Mar 21 2024
In T359067#9627299, @elukey wrote:@akosiaris thanks a lot for all the details, really appreciated, now I have a better understanding of the problem :)
I have a proposal to unblock my team, let me know what you think about it. On the ML side, we are doing the following:
- Try to reduce the pytorch's size, understanding if we can drop something (for example, support less GPUs etc..). We are logging work in T359569, but I am not super confident that we'll be able to get a significant reduction without coming up with a very complicated and difficult-to-maintain custom build process (like custom Python wheel to store somewhere, long build times to recreate pytorch when needed in CI, etc).
Alerts gone, I 'll resolve this.
So, since I 've never done this before (that I remember of), double check me on this please. Is it just enough to issue
In T360598#9648509, @brouberol wrote:brouberol@kafka-main2001:~$ echo y | openssl s_client -connect $(hostname -f):9093 | openssl x509 -issuer -nout x509: Unrecognized flag nout x509: Use -help for summary. depth=2 C = US, ST = California, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = Cloud Services, CN = Wikimedia_Internal_Root_CA verify return:1 depth=1 C = US, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = SRE Foundations, CN = kafka verify return:1 depth=0 CN = kafka-main2001.codfw.wmnet verify return:1 DONEThe runbook mentions
If the CA mentioned is:
- the Puppet one, then you'll need to follow Cergen#Update_a_certificate and deploy the new certificate to all nodes.
- the Kafka PKI Intermediate one, then in theory a new certificate should be issued few days before the expiry and puppet should replace the Kafka keystore automatically (under /etc/kafka/ssl).
@akosiaris do you happen to know which one it is in that case? It's not obvious to me. Thanks!
I'd tend to say Kafka PKI Intermediate due to depth=1 CN=kafka but a confirmation would be perfect.
Adding @brouberol as they probably have way more experience than serviceops on refreshing kafka certificates than anyone in serviceops
Related alerts in alerts.wikimedia.org have been silenced from 30 days (chosen arbitrarily) with a comment pointing to this task.
Mar 20 2024
In T358738#9644283, @tstarling wrote:I thought there was no cross-DC replication of thumbnails. T299125#8221206 seems to support that. So it's expected that a bad file created by T344233 would only affect one swift DC.
Mar 19 2024
We had to repool kartotherian in codfw as we had a CPU exhaustion event in eqiad right after the services switchover. Since some kartotherian endpoints create an amplification effect to kartotherian itself, we opted for restarting kartotherian in eqiad to fix that.
In T358738#9639319, @TheDJ wrote:ping @akosiaris Ideas on why codfw is out of date and won't correct ? Is it out of rotation or something ?
I wanted to point out that as the migration progresses and the size of MediaWiki deployments in WikiKube increases, it is inevitable that the deployment times for MW-on-K8s will increase too. Right now, we upgrade to each new version in chunks of 3% (16d6e717a7a) of the total. This is a relatively latest development, in the past we upgraded in larger chunks, since the overall size of each deployment was smaller. I expect those numbers to increase more, but I also expect the numbers for scap deploying to "legacy" infrastructure to decrease. Not proportionally of course.
Mar 17 2024
Mar 10 2024
In T323169#9616553, @taavi wrote:Is the HTTP response body for those 403s saved anywhere?
Mar 8 2024
Zotero is using url downloader to access the internet. It's logs end up in logstash e.g.
In T359067#9606960, @elukey wrote:
@JMeybohm is there anything left to dohere? I think we can resolve.
Mar 6 2024
Almost all parsoid hosts have been reimaged as kubernetes nodes. Scandium, testreduce1002, parse1001 and parse1002 being the exceptions. The former 2 because it was requested in T357392#9546852, the other 2 because we don't want to mess with the state of parsoid-php right before the SRE summit and DC switchover. I 'll reword this task a bit and then resolve it and file a cleanup follow up task for the last 2 nodes to reimage and related cleanups.
Mar 5 2024
So this is a difficult one to tackle. From what I gather images (and layers) can end up being really large, close to 10GB. I have questions regarding how a pip install ends up consuming 10GB of disk space of course but the main issue here is probably that this is going to cause issue down the road anyway. So that is probably unsustainable long term.
We at ~50% mw-parsoid right now.
I 've added another 220 CPUs for codfw and 300 for eqiad, we should be good on this front. I 'll resolve in the interest of sparing someone else from doing so, feel free to reopen.
I 've accounted for the cordoned nodes and indeed...