This is pretty concerning. What we also see is unknown: Status code 503; upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 113 which means usually that the destination is unreachable, aka EHOSTUNREACH

Fri, Apr 26, 1:26 PM · Discovery-Search (Current work), serviceops-radar, CirrusSearch

akosiaris added a comment to T363521: Completion suggester can promote a bad build.

In T363521#9746443, @EBernhardson wrote:

Looking at the Host overview dashboard for mwmaint1002 for today can see that there were intermittent network errors from 03:00 until 06:50.

Fri, Apr 26, 1:07 PM · Discovery-Search (Current work), serviceops-radar, CirrusSearch

Thu, Apr 25

akosiaris added a comment to T363399: Q4:rack/setup/install parsoidtest1001.

In T363399#9745051, @MoritzMuehlenhoff wrote:

Will parsoidtest1001 be installed with Bullseye? scandium is currently running buster, but all the mediawiki manifests are compatible with bullseye (cloudweb already runs it), and so is the component/php74.

Thu, Apr 25, 4:01 PM · Patch-For-Review, SRE, serviceops, ops-eqiad, DC-Ops

akosiaris updated the task description for T363399: Q4:rack/setup/install parsoidtest1001.

Thu, Apr 25, 4:00 PM · Patch-For-Review, SRE, serviceops, ops-eqiad, DC-Ops

akosiaris updated the task description for T363399: Q4:rack/setup/install parsoidtest1001.

Thu, Apr 25, 4:00 PM · Patch-For-Review, SRE, serviceops, ops-eqiad, DC-Ops

akosiaris added a project to T363402: parsoidtest1001 implementation tracking: Parsoid.

Adding @subbu for their information.

Thu, Apr 25, 1:17 PM · Parsoid, Patch-For-Review, serviceops

Wed, Apr 24

akosiaris added a member for Catalyst: akosiaris.

Wed, Apr 24, 2:43 PM

akosiaris added a comment to T255568: Envoy should listen on ipv6 and ipv4.

Since mesh.configuration 1.7, envoy on WikiKube and other kubernetes clusters listens on IPv6 and IPv4 for both the TLS terminator and the service mesh listeners. Charts are slowly being updated. On the kubernetes side, once all charts are updated, we 'll be done.

Wed, Apr 24, 2:41 PM · Patch-For-Review, envoy, observability, serviceops

akosiaris added a comment to T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21.

I 've also just run kubelet 1.23 in standalone mode talking to containerd and indeed processes in containers run with cri-containerd.apparmor.d apparmor profile.

Wed, Apr 24, 2:38 PM · Patch-For-Review, serviceops, Prod-Kubernetes

akosiaris updated the task description for T362408: Migration to containerd and away from docker.

Wed, Apr 24, 2:00 PM · Prod-Kubernetes, Kubernetes, serviceops

akosiaris updated the task description for T362408: Migration to containerd and away from docker.

Wed, Apr 24, 1:59 PM · Prod-Kubernetes, Kubernetes, serviceops

akosiaris added a comment to T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21.

Adding for bookwork

Wed, Apr 24, 11:44 AM · Patch-For-Review, serviceops, Prod-Kubernetes

akosiaris added a comment to T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21.

Adding as info since it was requested in T362408#9712356

Wed, Apr 24, 11:23 AM · Patch-For-Review, serviceops, Prod-Kubernetes

akosiaris added a comment to T362408: Migration to containerd and away from docker.

In T362408#9712356, @JMeybohm wrote:

@akosiaris could you please double check in your test environment that containerd will still enforce the default apparmor profile (see Remove apparmor.security.beta.kubernetes.io/defaultProfileName in T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21) like docker currently does?

Wed, Apr 24, 11:22 AM · Prod-Kubernetes, Kubernetes, serviceops

Tue, Apr 23

akosiaris closed T363086: ManagementSSHDown parse1002.eqiad.wmnet as Resolved.

I am resolving, hopefully we won't see a recurrence.

Tue, Apr 23, 3:05 PM · SRE, ops-eqiad

akosiaris closed T363086: ManagementSSHDown parse1002.eqiad.wmnet, a subtask of T361396: 1.43.0-wmf.2 deployment blockers, as Resolved.

Tue, Apr 23, 3:04 PM · User-brennen, Release-Engineering-Team (Priority Backlog 📥), Release, Train Deployments

akosiaris added a comment to T363086: ManagementSSHDown parse1002.eqiad.wmnet.

I 've just uncordoned it, it should receive mediawiki payloads in the next deployment. I 've also checked and it's again a scap target for kubernetes-workers group.

Tue, Apr 23, 3:03 PM · SRE, ops-eqiad

akosiaris added a comment to T363086: ManagementSSHDown parse1002.eqiad.wmnet.

In T363086#9735726, @Jclark-ctr wrote:

@akosiaris @hashar reset idrac with no change i will need to reboot server and hook crash cart up to it. Please advise if i am able to reboot.

Tue, Apr 23, 2:12 PM · SRE, ops-eqiad

akosiaris added a comment to T362681: Provide nodejs20 base images for production.

In T362681#9729331, @MoritzMuehlenhoff wrote:

That's not problem. We should just use the nodesource packages for this, we've been doing the same for "intermediate LTSes" before (e.g. node 16 or node 14) not covered by an intree Debian nodejs version. I'll work on this next week.

Tue, Apr 23, 9:29 AM · serviceops

akosiaris added a comment to T363086: ManagementSSHDown parse1002.eqiad.wmnet.

In T363086#9734314, @hashar wrote:

parse1002.eqiad.wmnet is down / unreachable but is still in the pool of hosts to deploy tool. That has caused the MediaWiki train to fail over night and is causing every MediaWiki deployment to error out due to a timeout when trying reach that host.

Can one please remove the host from the pool of MediaWiki target hosts? Thanks!

Tue, Apr 23, 9:08 AM · SRE, ops-eqiad

Thu, Apr 18

akosiaris updated subscribers of T362681: Provide nodejs20 base images for production.

nodejs20 isn't even on trixie/sid right now https://packages.debian.org/trixie/nodejs, https://packages.debian.org/sid/nodejs but only in experimental.

Thu, Apr 18, 3:43 PM · serviceops

akosiaris added a comment to T120242: Eventually Consistent MediaWiki State Change Events.

Commenting here as well at the request of @Ottomata in T249745#9725953

Thu, Apr 18, 3:37 PM · Data-Engineering, Analytics, DBA, WMF-Architecture-Team, Platform Team Legacy (Later), Event-Platform, Services (later)

akosiaris added a comment to T249745: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable".

In T249745#9723704, @Ottomata wrote:

For replicating state changes (T120242) [...]

Why though? Why is 99.9999% (or 99.999999% or 99.99%) not enough?

There is a "Why do we need this?" section in T120242's description. Let's keep this discussion there?

Thu, Apr 18, 3:37 PM · MediaWiki-Engineering, Data-Engineering, Unstewarded-production-error, User-brennen, serviceops, WMF-JobQueue, Wikimedia-production-error

Wed, Apr 17

akosiaris added a comment to T249745: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable".

In T249745#9722942, @Ottomata wrote:

see the CAP theorem

C != eventual-C. Eventual Consistency + AP is feasible and done often.

Wed, Apr 17, 4:39 PM · MediaWiki-Engineering, Data-Engineering, Unstewarded-production-error, User-brennen, serviceops, WMF-JobQueue, Wikimedia-production-error

akosiaris added a comment to T362628: Find a way to stage updated OS packages on wikikube.

We have staging base images and staging service images updated daily based on what is in Debian and apt.wikimedia.org

Wed, Apr 17, 3:07 PM · Release-Engineering-Team, serviceops, MW-on-K8s, Scap

Mon, Apr 15

akosiaris created T362568: Drop backports from base images.

Mon, Apr 15, 5:47 PM · Infrastructure-Foundations, serviceops

akosiaris created T362567: Add docker production images repo to codesearch.

Mon, Apr 15, 5:40 PM · VPS-project-Codesearch, serviceops

akosiaris lowered the priority of T362518: Deprecate buster-backports from Unbreak Now! to High.

The immediate issue blocking the train has been resolved and new images have been pushed. Hence, lowering to High. There's a tail of images being rebuilt still and it's going to take a while longer, but this is no longer a UBN

Mon, Apr 15, 2:34 PM · Infrastructure-Foundations, Release-Engineering-Team, serviceops

Fri, Apr 12

akosiaris created T362408: Migration to containerd and away from docker.

Fri, Apr 12, 1:44 PM · Prod-Kubernetes, Kubernetes, serviceops

akosiaris added a comment to T362239: Reformat IRC alerts to be more useful.

Thanks for tackling this!

Fri, Apr 12, 11:12 AM · Patch-For-Review, Observability-Alerting

akosiaris awarded T362239: Reformat IRC alerts to be more useful a Love token.

Fri, Apr 12, 11:03 AM · Patch-For-Review, Observability-Alerting

Thu, Apr 11

akosiaris added a comment to T360907: Can we please add our vendor to Google Postmaster Tools.

I don't think SRE has ever administrated Google Postmaster Tools at all. In fact, a quick cross check in the team showcases almost utter ignorance of the product, although we 'll ask internally a bit more. May I suggest reaching out to ITS too?

Thu, Apr 11, 1:26 PM · SRE-Access-Requests, Fundraising-Backlog

Apr 9 2024

akosiaris added a comment to T360636: Phase out cergen for ServiceOps services.

In T360636#9700038, @MoritzMuehlenhoff wrote:

In T360636#9698325, @akosiaris wrote:

I 'll finish parsoid and testreduce in T359387

If I'm not mistaken testreduce is still unrelated, it's for the round trip tests that have been split off to a separate Ganeti VM some time ago (and was moved to Bookworm due to nodejs requirements last year)?

Apr 9 2024, 9:49 AM · serviceops, Epic, SRE

Apr 8 2024

akosiaris added a comment to T360636: Phase out cergen for ServiceOps services.

I 'll finish parsoid and testreduce in T359387

Apr 8 2024, 4:00 PM · serviceops, Epic, SRE

akosiaris added a comment to T249745: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable".

In T249745#9640217, @Ottomata wrote:

This doesn't mean that MediaWiki shoudn't try to improve the situation by handling the failure to submit a job by saving it somewhere (a specific db table?) and we can replay them later. At the current failure rate, this would guarantee the jobs would be executed with an irrelevant cost in terms of resources.

@Joe this sounds sort of similar to the Outbox solution described in T120242, albeit only for failed submissions instead of all of them. Functionally this sounds like a nice solution to the eventual consistency problem described there, but I'd expect it would add some latency to the user response (waiting for ACK from EventGate+Kafka). Actually it sounds more like this (discarded?) solution, except:

Apr 8 2024, 2:40 PM · MediaWiki-Engineering, Data-Engineering, Unstewarded-production-error, User-brennen, serviceops, WMF-JobQueue, Wikimedia-production-error

Apr 4 2024

akosiaris added a comment to T328036: MCS decommission (2023).

In T361483, I 've been poking into selectively killing parts of changeprop that are no longer used. I am still in the /hopefully easy pickings/ phase, attacking things we KNOW aren't used any more. I am now targetting removing functionality from changeprop that refreshes all the mobile-sections parts of RESTBase, meaning RESTBase will no longer have up to date content for these endpoints.

Apr 4 2024, 1:09 PM · Essential-Work, Content-Transform-Team-WIP, Mobile-Content-Service

akosiaris added a comment to T361483: Selectively disable changeprop functionality that is no longer used.

In T361483#9680093, @elukey wrote:

In T361483#9680024, @akosiaris wrote:

In T361483#9679703, @elukey wrote:

Hello!

There ores_cache job should be defined but disabled in the running config, we don't use it anymore and IIRC it is not running anymore in ChangeProp (lemme know otherwise).

You are correct. I 'll post a patch then to remove it. Thanks!

For Lift Wing, we just use CP to call inference.discovery.wmnet, no restbase involved. The idea is to create streams like "for every rev-id, get a score from a Lift Wing model".

This is probably something we want to move away from Changeprop then and in the jobqueue (same software, I know, but a different installation). Looking at the config, I think that there is no code that is specific to LiftWing, just standard reaction to events on kafka.

No problem for me! I can only see one issue, and this is something not specific to our topics: if we start another job in cp-jobqueue, the kafka consumer offset will be reset to whatever is the last element in the topic, and we'll potentially loose events in the stream. It is not a huge deal since at the moment nothing incredibly critical relies on them, but IIRC Search uses one of the running topics to update Elastic Search. If we move everything over we'd need to sync with them and figure out if a "hole" in the stream is acceptable, otherwise the only thing that I can think of is:

stop the changeprop rule for the lift wing topic that Search uses.

write down the offset of the related consumer group using the kafka api (IIRC it should be possible)

create another consumer group in cp-jobqueue with the same initial offset (this is not super difficult but I have never done it).

add the rule to cp-jobqueue and check if it works.

Apr 4 2024, 1:07 PM · Patch-For-Review, Machine-Learning-Team, Lift-Wing, ORES, RESTBase Sunsetting, Content-Transform-Team, serviceops, ChangeProp, API Platform (RESTBase Deprecation Roadmap)

akosiaris added a comment to T361483: Selectively disable changeprop functionality that is no longer used.

Next up. mobile-sections. It's deprecated per T328036 for a long time now. I 'll remove rules updating mobile-sections endpoints. That should be fine for external users, we have been returning for many months now 403 to almost everyone (exceptions are still around for Kiwix and Wikiwand, T340036).

Apr 4 2024, 1:01 PM · Patch-For-Review, Machine-Learning-Team, Lift-Wing, ORES, RESTBase Sunsetting, Content-Transform-Team, serviceops, ChangeProp, API Platform (RESTBase Deprecation Roadmap)

akosiaris added a comment to T340036: Setup allowed list for MCS decom.

I guess it's about time I ask if it is ok to remove those exceptions now and return 403 to everyone for these endpoints.

Apr 4 2024, 12:57 PM · affects-Kiwix-and-openZIM, Content-Transform-Team-WIP, RESTBase Sunsetting, SRE, serviceops, Traffic, Mobile-Content-Service

akosiaris edited projects for T360403: Helm deployment of MediaWiki now takes 6 minutes, added: serviceops-radar; removed serviceops.

Moving it to our radar too as we intend to revisit various parts of all of this (e.g. how we do MultiVersion once we are no longer constrained by the legacy infra), but we don't have something concrete right now.

Apr 4 2024, 10:06 AM · serviceops-radar, Release-Engineering-Team (Radar), MW-on-K8s

akosiaris updated the task description for T361483: Selectively disable changeprop functionality that is no longer used.

Apr 4 2024, 8:12 AM · Patch-For-Review, Machine-Learning-Team, Lift-Wing, ORES, RESTBase Sunsetting, Content-Transform-Team, serviceops, ChangeProp, API Platform (RESTBase Deprecation Roadmap)

Apr 2 2024

akosiaris updated subscribers of T309772: npm audit reports several security issues with Service runner.

In T309772#9680594, @Mvolz wrote:

The last remaining original Services member left in 2022.

Apr 2 2024, 3:45 PM · MediaWiki-Engineering, CX-cxserver, Security, service-runner

akosiaris added a comment to T360804: macOS aarch64 support.

Use qemu to run x86_64 containers on an aarch64 VM

Apr 2 2024, 1:39 PM · ARM support, Infrastructure-Foundations

akosiaris added a comment to T361483: Selectively disable changeprop functionality that is no longer used.

In T361483#9679703, @elukey wrote:

Hello!

There ores_cache job should be defined but disabled in the running config, we don't use it anymore and IIRC it is not running anymore in ChangeProp (lemme know otherwise).

Apr 2 2024, 1:11 PM · Patch-For-Review, Machine-Learning-Team, Lift-Wing, ORES, RESTBase Sunsetting, Content-Transform-Team, serviceops, ChangeProp, API Platform (RESTBase Deprecation Roadmap)

Apr 1 2024

akosiaris added projects to T361483: Selectively disable changeprop functionality that is no longer used: ORES, Lift-Wing.

Let's start with the "easy" ones. I see feature flags in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/changeprop/templates/_config.yaml for

Apr 1 2024, 4:42 PM · Patch-For-Review, Machine-Learning-Team, Lift-Wing, ORES, RESTBase Sunsetting, Content-Transform-Team, serviceops, ChangeProp, API Platform (RESTBase Deprecation Roadmap)

akosiaris added a subtask for T262315: <CORE TECHNOLOGY> API Migration & RESTbase Sunset: T361483: Selectively disable changeprop functionality that is no longer used.

Apr 1 2024, 4:27 PM · API Platform (RESTBase Deprecation Roadmap), Epic, Foundational Technology Requests, Code-Health, Platform Engineering Roadmap, Platform Engineering Roadmap Decision Making

akosiaris added a parent task for T361483: Selectively disable changeprop functionality that is no longer used: T262315: <CORE TECHNOLOGY> API Migration & RESTbase Sunset.

Apr 1 2024, 4:27 PM · Patch-For-Review, Machine-Learning-Team, Lift-Wing, ORES, RESTBase Sunsetting, Content-Transform-Team, serviceops, ChangeProp, API Platform (RESTBase Deprecation Roadmap)

akosiaris created T361483: Selectively disable changeprop functionality that is no longer used.

Apr 1 2024, 4:27 PM · Patch-For-Review, Machine-Learning-Team, Lift-Wing, ORES, RESTBase Sunsetting, Content-Transform-Team, serviceops, ChangeProp, API Platform (RESTBase Deprecation Roadmap)

akosiaris added a comment to T360596: Figure out a plan to move forward with regarding Redis License changes.

LWN has an article titled "The race to replace Redis". I am not going to link directly as it is LWN subscriber only content but I can summarize (note I am pasting links in their entirety on purpose) the "Forks and alternatives" section:

Apr 1 2024, 12:57 PM · GitLab (Infrastructure), Patch-For-Review, User-aborrero, serviceops, MediaWiki-Platform-Team (Radar), collaboration-services, Release-Engineering-Team (Radar), Quarry, Toolforge, Software-Licensing, Infrastructure-Foundations, netbox, Platform Team Initiatives (API Gateway), ChangeProp, MediaWiki-File-management, SRE

Mar 22 2024

akosiaris added a comment to T358577: Service Ops Review of Metrics Platform Configuration Management UI.

I 've already left various comments on the 2 docs. I am still going through the Miro board, but I can summarize the following:

Mar 22 2024, 12:36 PM · Data Products (Data Products Sprint 13), serviceops

akosiaris added a comment to T358577: Service Ops Review of Metrics Platform Configuration Management UI.

Hi @MShilova_WMF. This is on my list for today, it might spill into early next week though. I 've started the review but I don't see to have access to T358115 (linked from the description), could you please grant me access?

Mar 22 2024, 7:16 AM · Data Products (Data Products Sprint 13), serviceops

Mar 21 2024

akosiaris updated the task description for T360637: Bump memory for registry[12]00[34] VMs.

Mar 21 2024, 2:49 PM · Patch-For-Review, serviceops, Machine-Learning-Team

akosiaris added a comment to T359067: Find an efficient strategy to add Pytorch and ROCm packages to our Docker images.

In T359067#9627299, @elukey wrote:

@akosiaris thanks a lot for all the details, really appreciated, now I have a better understanding of the problem :)

I have a proposal to unblock my team, let me know what you think about it. On the ML side, we are doing the following:

Try to reduce the pytorch's size, understanding if we can drop something (for example, support less GPUs etc..). We are logging work in T359569, but I am not super confident that we'll be able to get a significant reduction without coming up with a very complicated and difficult-to-maintain custom build process (like custom Python wheel to store somewhere, long build times to recreate pytorch when needed in CI, etc).

Mar 21 2024, 1:06 PM · Machine-Learning-Team

akosiaris closed T360598: kafka-main certificates expiring on 2024-04-04 as Resolved.

Alerts gone, I 'll resolve this.

Mar 21 2024, 12:22 PM · Data-Platform-SRE, Data-Engineering, serviceops

akosiaris renamed T360594: an-worker1168 in a weird statue, possibly due to I/O errors from an-worker1168 in a weird statue, possiblye due to I/O errors to an-worker1168 in a weird statue, possibly due to I/O errors.

Mar 21 2024, 11:12 AM · Data-Platform-SRE (2024.03.25 - 2024.04.14)

akosiaris added a comment to T360598: kafka-main certificates expiring on 2024-04-04.

So, since I 've never done this before (that I remember of), double check me on this please. Is it just enough to issue

Mar 21 2024, 9:33 AM · Data-Platform-SRE, Data-Engineering, serviceops

akosiaris added a comment to T360598: kafka-main certificates expiring on 2024-04-04.

In T360598#9648509, @brouberol wrote:
brouberol@kafka-main2001:~$ echo y | openssl s_client -connect $(hostname -f):9093  | openssl x509 -issuer -nout
x509: Unrecognized flag nout
x509: Use -help for summary.
depth=2 C = US, ST = California, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = Cloud Services, CN = Wikimedia_Internal_Root_CA
verify return:1
depth=1 C = US, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = SRE Foundations, CN = kafka
verify return:1
depth=0 CN = kafka-main2001.codfw.wmnet
verify return:1
DONE
The runbook mentions

If the CA mentioned is:

the Puppet one, then you'll need to follow Cergen#Update_a_certificate and deploy the new certificate to all nodes.

the Kafka PKI Intermediate one, then in theory a new certificate should be issued few days before the expiry and puppet should replace the Kafka keystore automatically (under /etc/kafka/ssl).

@akosiaris do you happen to know which one it is in that case? It's not obvious to me. Thanks!

I'd tend to say Kafka PKI Intermediate due to depth=1 CN=kafka but a confirmation would be perfect.

Mar 21 2024, 9:31 AM · Data-Platform-SRE, Data-Engineering, serviceops

akosiaris triaged T360598: kafka-main certificates expiring on 2024-04-04 as High priority.

Adding @brouberol as they probably have way more experience than serviceops on refreshing kafka certificates than anyone in serviceops

Mar 21 2024, 9:16 AM · Data-Platform-SRE, Data-Engineering, serviceops

akosiaris created T360598: kafka-main certificates expiring on 2024-04-04.

Mar 21 2024, 9:14 AM · Data-Platform-SRE, Data-Engineering, serviceops

akosiaris added a project to T360596: Figure out a plan to move forward with regarding Redis License changes: netbox.

Mar 21 2024, 8:59 AM · GitLab (Infrastructure), Patch-For-Review, User-aborrero, serviceops, MediaWiki-Platform-Team (Radar), collaboration-services, Release-Engineering-Team (Radar), Quarry, Toolforge, Software-Licensing, Infrastructure-Foundations, netbox, Platform Team Initiatives (API Gateway), ChangeProp, MediaWiki-File-management, SRE

akosiaris created T360596: Figure out a plan to move forward with regarding Redis License changes.

Mar 21 2024, 8:58 AM · GitLab (Infrastructure), Patch-For-Review, User-aborrero, serviceops, MediaWiki-Platform-Team (Radar), collaboration-services, Release-Engineering-Team (Radar), Quarry, Toolforge, Software-Licensing, Infrastructure-Foundations, netbox, Platform Team Initiatives (API Gateway), ChangeProp, MediaWiki-File-management, SRE

akosiaris added a comment to T360594: an-worker1168 in a weird statue, possibly due to I/O errors.

Related alerts in alerts.wikimedia.org have been silenced from 30 days (chosen arbitrarily) with a comment pointing to this task.

Mar 21 2024, 8:02 AM · Data-Platform-SRE (2024.03.25 - 2024.04.14)

akosiaris updated the task description for T360594: an-worker1168 in a weird statue, possibly due to I/O errors.

Mar 21 2024, 7:54 AM · Data-Platform-SRE (2024.03.25 - 2024.04.14)

akosiaris created T360594: an-worker1168 in a weird statue, possibly due to I/O errors.

Mar 21 2024, 7:51 AM · Data-Platform-SRE (2024.03.25 - 2024.04.14)

Mar 20 2024

akosiaris added a comment to T358738: Commons thumbnails are broken for certain large sizes of thumbnail images.

In T358738#9644283, @tstarling wrote:

I thought there was no cross-DC replication of thumbnails. T299125#8221206 seems to support that. So it's expected that a bad file created by T344233 would only affect one swift DC.

Mar 20 2024, 9:00 AM · SRE-swift-storage, serviceops, Commons

Mar 19 2024

akosiaris added a comment to T357547: ☂️ Northward Datacentre Switchover (March 2024) .

We had to repool kartotherian in codfw as we had a CPU exhaustion event in eqiad right after the services switchover. Since some kartotherian endpoints create an amplification effect to kartotherian itself, we opted for restarting kartotherian in eqiad to fix that.

Mar 19 2024, 3:56 PM · Patch-For-Review, Datacenter-Switchover, Data-Persistence, SRE Observability (FY2023/2024-Q3), collaboration-services, observability, serviceops, DC-Ops, Traffic

akosiaris added a comment to T358738: Commons thumbnails are broken for certain large sizes of thumbnail images.

In T358738#9639319, @TheDJ wrote:

ping @akosiaris Ideas on why codfw is out of date and won't correct ? Is it out of rotation or something ?

Mar 19 2024, 9:29 AM · SRE-swift-storage, serviceops, Commons

akosiaris added a comment to T360403: Helm deployment of MediaWiki now takes 6 minutes.

I wanted to point out that as the migration progresses and the size of MediaWiki deployments in WikiKube increases, it is inevitable that the deployment times for MW-on-K8s will increase too. Right now, we upgrade to each new version in chunks of 3% (16d6e717a7a) of the total. This is a relatively latest development, in the past we upgraded in larger chunks, since the overall size of each deployment was smaller. I expect those numbers to increase more, but I also expect the numbers for scap deploying to "legacy" infrastructure to decrease. Not proportionally of course.

Mar 19 2024, 9:23 AM · serviceops-radar, Release-Engineering-Team (Radar), MW-on-K8s

Mar 17 2024

MSantos awarded T357392: Create parsoid mediawiki deployment and migrate parsoid-php.discovery.wmnet traffic to it a Love token.

Mar 17 2024, 1:51 PM · Patch-For-Review, Content-Transform-Team, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s

Mar 10 2024

akosiaris added a comment to T323169: Internal Server Errors from Zotero with nytimes.com.

In T323169#9616553, @taavi wrote:

Is the HTTP response body for those 403s saved anywhere?

Mar 10 2024, 8:11 AM · Editing-team (Kanban Board), Citoid, Cite, VisualEditor

Mar 8 2024

akosiaris added a comment to T323169: Internal Server Errors from Zotero with nytimes.com.

Zotero is using url downloader to access the internet. It's logs end up in logstash e.g.

Mar 8 2024, 2:32 PM · Editing-team (Kanban Board), Citoid, Cite, VisualEditor

akosiaris added a comment to T359067: Find an efficient strategy to add Pytorch and ROCm packages to our Docker images.

In T359067#9606960, @elukey wrote:

Mar 8 2024, 1:18 PM · Machine-Learning-Team

akosiaris added a comment to T256762: Fix nginx config and caching for docker registry .

@JMeybohm is there anything left to dohere? I think we can resolve.

Mar 8 2024, 12:53 PM · serviceops, Kubernetes, SRE

akosiaris merged task T307797: Clean-up / delete old versions of service pipeline created docker images from the public docker registry? into T242604: Remove obsoleted docker images.

Mar 8 2024, 12:51 PM · User-MoritzMuehlenhoff, Release Pipeline, serviceops

akosiaris merged T307797: Clean-up / delete old versions of service pipeline created docker images from the public docker registry? into T242604: Remove obsoleted docker images.

Mar 8 2024, 12:51 PM · Release-Engineering-Team (Radar), Upstream, User-brennen, SRE, Release Pipeline, serviceops

Mar 6 2024

akosiaris triaged T359387: Cleanup parsoid-php service as Low priority.

Mar 6 2024, 2:54 PM · Parsoid (Tracking), Patch-For-Review, serviceops

akosiaris closed T357392: Create parsoid mediawiki deployment and migrate parsoid-php.discovery.wmnet traffic to it, a subtask of T290536: Serve production traffic via Kubernetes, as Resolved.

Mar 6 2024, 2:34 PM · Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s

akosiaris closed T357392: Create parsoid mediawiki deployment and migrate parsoid-php.discovery.wmnet traffic to it as Resolved.

Mar 6 2024, 2:34 PM · Patch-For-Review, Content-Transform-Team, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s

akosiaris created T359387: Cleanup parsoid-php service.

Mar 6 2024, 2:33 PM · Parsoid (Tracking), Patch-For-Review, serviceops

akosiaris closed T358752: Reimage parse* hosts as kubernetes nodes as Resolved.

Mar 6 2024, 2:26 PM · Content-Transform-Team, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s

akosiaris closed T358752: Reimage parse* hosts as kubernetes nodes, a subtask of T357392: Create parsoid mediawiki deployment and migrate parsoid-php.discovery.wmnet traffic to it, as Resolved.

Mar 6 2024, 2:24 PM · Patch-For-Review, Content-Transform-Team, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s

akosiaris added a comment to T358752: Reimage parse* hosts as kubernetes nodes.

Almost all parsoid hosts have been reimaged as kubernetes nodes. Scandium, testreduce1002, parse1001 and parse1002 being the exceptions. The former 2 because it was requested in T357392#9546852, the other 2 because we don't want to mess with the state of parsoid-php right before the SRE summit and DC switchover. I 'll reword this task a bit and then resolve it and file a cleanup follow up task for the last 2 nodes to reimage and related cleanups.

Mar 6 2024, 2:23 PM · Content-Transform-Team, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s

akosiaris updated the task description for T357392: Create parsoid mediawiki deployment and migrate parsoid-php.discovery.wmnet traffic to it.

Mar 6 2024, 2:20 PM · Patch-For-Review, Content-Transform-Team, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s

akosiaris updated the task description for T357392: Create parsoid mediawiki deployment and migrate parsoid-php.discovery.wmnet traffic to it.

Mar 6 2024, 7:37 AM · Patch-For-Review, Content-Transform-Team, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s

Mar 5 2024

akosiaris added a comment to T359067: Find an efficient strategy to add Pytorch and ROCm packages to our Docker images.

So this is a difficult one to tackle. From what I gather images (and layers) can end up being really large, close to 10GB. I have questions regarding how a pip install ends up consuming 10GB of disk space of course but the main issue here is probably that this is going to cause issue down the road anyway. So that is probably unsustainable long term.

Mar 5 2024, 4:40 PM · Machine-Learning-Team

akosiaris added a comment to T357392: Create parsoid mediawiki deployment and migrate parsoid-php.discovery.wmnet traffic to it.

We at ~50% mw-parsoid right now.

Mar 5 2024, 2:41 PM · Patch-For-Review, Content-Transform-Team, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s

akosiaris updated the task description for T357392: Create parsoid mediawiki deployment and migrate parsoid-php.discovery.wmnet traffic to it.

Mar 5 2024, 2:38 PM · Patch-For-Review, Content-Transform-Team, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s

akosiaris closed T359114: Slow and failed deployments as Resolved.

I 've added another 220 CPUs for codfw and 300 for eqiad, we should be good on this front. I 'll resolve in the interest of sparing someone else from doing so, feel free to reopen.

Mar 5 2024, 10:33 AM · serviceops, MW-on-K8s

akosiaris closed T359114: Slow and failed deployments, a subtask of T354439: 1.42.0-wmf.21 deployment blockers, as Resolved.

Mar 5 2024, 10:32 AM · Release-Engineering-Team (Now this 🫠), Release, Train Deployments

akosiaris added a comment to T359114: Slow and failed deployments.

I 've accounted for the cordoned nodes and indeed...

Mar 5 2024, 9:54 AM · serviceops, MW-on-K8s

Advanced SearchUse ResultsEdit QueryHide Query

Tue, Apr 30

Mon, Apr 29

Fri, Apr 26

Thu, Apr 25

Wed, Apr 24

Tue, Apr 23

Thu, Apr 18

Wed, Apr 17

Mon, Apr 15

Fri, Apr 12

Thu, Apr 11

Apr 9 2024

Apr 8 2024

Apr 4 2024

Apr 2 2024

Apr 1 2024

Mar 22 2024

Mar 21 2024

Mar 20 2024

Mar 19 2024

Mar 17 2024

Mar 10 2024

Mar 8 2024

Mar 6 2024

Mar 5 2024

Advanced Search
Use Results
Edit Query
Hide Query