jijiki renamed T273950: Modernise memcached systemd unit / sync, and make it presentable from Modernise memcached systemd unit / sync to current buster setup to Modernise memcached systemd unit / sync, and make it presentable.

Thu, Apr 11, 6:51 PM · serviceops, User-jijiki, SRE

jijiki added a comment to T361724: scap should check if it is running within a tmux/screen.

@dancy it would be great if someone could finish this soon. While scap now does have an option to mitigate potential helm hiccups, I think we should add it to the mix nevertheless

Thu, Apr 11, 6:36 PM · Patch-For-Review, Sustainability (Incident Followup), Scap, Release-Engineering-Team, serviceops

jijiki added a subtask for T290536: Serve production traffic via Kubernetes: T346690: mcrouter daemonset on mw-on-k8s.

Thu, Apr 11, 12:40 PM · Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s

jijiki added a parent task for T346690: mcrouter daemonset on mw-on-k8s: T290536: Serve production traffic via Kubernetes.

Thu, Apr 11, 12:40 PM · MediaWiki-Platform-Team (Radar), Patch-For-Review, serviceops, MW-on-K8s

jijiki added a subtask for T290536: Serve production traffic via Kubernetes: T277711: Memcached, mcrouter in MediaWiki on Kubernetes.

Thu, Apr 11, 12:39 PM · Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s

jijiki removed a subtask for T265327: Create a basic helm chart to test MediaWiki on kubernetes: T277711: Memcached, mcrouter in MediaWiki on Kubernetes.

Thu, Apr 11, 12:39 PM · SRE, serviceops, MW-on-K8s

jijiki edited parent tasks for T277711: Memcached, mcrouter in MediaWiki on Kubernetes, added: T290536: Serve production traffic via Kubernetes; removed: T265327: Create a basic helm chart to test MediaWiki on kubernetes.

Thu, Apr 11, 12:39 PM · serviceops, SRE

jijiki claimed T292707: Migrate Wikitech to Kubernetes.

Thu, Apr 11, 12:36 PM · wikitech.wikimedia.org, MW-on-K8s, serviceops

jijiki awarded T362323: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) a Burninate token.

Thu, Apr 11, 12:23 PM · Patch-For-Review, MoveComms-Support, SRE, Traffic, serviceops, MW-on-K8s

jijiki updated the task description for T350807: Package latest version of prometheus-memcached-exporter (v0.14.2).

Thu, Apr 11, 11:29 AM · serviceops

jijiki closed T350807: Package latest version of prometheus-memcached-exporter (v0.14.2) as Resolved.

Built and repackaged

Thu, Apr 11, 11:29 AM · serviceops

jijiki closed T350807: Package latest version of prometheus-memcached-exporter (v0.14.2), a subtask of T352891: Upgrade memcache and memcached gutter pools to Bookworm, as Resolved.

Thu, Apr 11, 11:29 AM · serviceops

jijiki closed T350807: Package latest version of prometheus-memcached-exporter (v0.14.2), a subtask of T352885: Enable extstore to a subset of memcached servers (experiment), as Resolved.

Thu, Apr 11, 11:29 AM · serviceops

jijiki closed T362160: Repackage memkeys for debian bookworm as Resolved.

Built and uploaded

Thu, Apr 11, 11:28 AM · serviceops

jijiki added a parent task for T350807: Package latest version of prometheus-memcached-exporter (v0.14.2): T352891: Upgrade memcache and memcached gutter pools to Bookworm.

Thu, Apr 11, 11:28 AM · serviceops

jijiki added a subtask for T352891: Upgrade memcache and memcached gutter pools to Bookworm: T350807: Package latest version of prometheus-memcached-exporter (v0.14.2).

Thu, Apr 11, 11:28 AM · serviceops

jijiki closed T362160: Repackage memkeys for debian bookworm, a subtask of T352891: Upgrade memcache and memcached gutter pools to Bookworm, as Resolved.

Thu, Apr 11, 11:27 AM · serviceops

jijiki updated the task description for T362160: Repackage memkeys for debian bookworm.

Thu, Apr 11, 11:27 AM · serviceops

jijiki renamed T350807: Package latest version of prometheus-memcached-exporter (v0.14.2) from Package latest version of prometheus-memcached-exporter (v0.14.1) to Package latest version of prometheus-memcached-exporter (v0.14.2).

Thu, Apr 11, 11:24 AM · serviceops

jijiki moved T362160: Repackage memkeys for debian bookworm from Incoming 🐫 to Doing 😎 on the serviceops board.

Thu, Apr 11, 9:39 AM · serviceops

jijiki claimed T360778: Move maps/karthoterian to PKI/cfssl.

Thu, Apr 11, 9:37 AM · serviceops, Maps, SRE

jijiki updated the task description for T360636: Phase out cergen for ServiceOps services.

Thu, Apr 11, 9:37 AM · Patch-For-Review, serviceops, Epic, SRE

jijiki updated the task description for T357750: Phase out cergen.

Thu, Apr 11, 9:36 AM · Patch-For-Review, Puppet-Infrastructure, Puppet (Puppet 7.0), Infrastructure-Foundations, SRE

jijiki removed a subtask for T357750: Phase out cergen: T360778: Move maps/karthoterian to PKI/cfssl.

Thu, Apr 11, 9:35 AM · Patch-For-Review, Puppet-Infrastructure, Puppet (Puppet 7.0), Infrastructure-Foundations, SRE

jijiki added a subtask for T360636: Phase out cergen for ServiceOps services: T360778: Move maps/karthoterian to PKI/cfssl.

Thu, Apr 11, 9:34 AM · Patch-For-Review, serviceops, Epic, SRE

jijiki edited parent tasks for T360778: Move maps/karthoterian to PKI/cfssl, added: T360636: Phase out cergen for ServiceOps services; removed: T357750: Phase out cergen.

Thu, Apr 11, 9:34 AM · serviceops, Maps, SRE

Tue, Apr 9

jijiki created T362160: Repackage memkeys for debian bookworm.

Tue, Apr 9, 1:50 PM · serviceops

jijiki added a comment to T361724: scap should check if it is running within a tmux/screen.

In T361724#9690204, @dancy wrote:

There is a tmux/screen check for scap stage-train, but nothing else. This could be factored out to cover other scap subcommands.
Suggestions:

scap backport

scap deploy

scap deploy-promote

scap sync-*

scap train

scap stage-train

scap lock

Applying the tmux/screen check for all scap subcommands (especially those unrelated to deployment) is definitely undesirable.

Tue, Apr 9, 1:08 PM · Patch-For-Review, Sustainability (Incident Followup), Scap, Release-Engineering-Team, serviceops

Thu, Apr 4

jijiki awarded T350507: Update mobileapps k8s deployment chart for Cassandra credentials a Love token.

Thu, Apr 4, 12:48 PM · Content-Transform-Team, Patch-For-Review, Page Content Service, serviceops, RESTBase Sunsetting

Wed, Apr 3

jijiki triaged T361724: scap should check if it is running within a tmux/screen as High priority.

Wed, Apr 3, 4:54 PM · Patch-For-Review, Sustainability (Incident Followup), Scap, Release-Engineering-Team, serviceops

jijiki created T361724: scap should check if it is running within a tmux/screen.

Wed, Apr 3, 4:53 PM · Patch-For-Review, Sustainability (Incident Followup), Scap, Release-Engineering-Team, serviceops

jijiki closed T361720: Helm was left in limbo due to interrupted deployment/rollback as Resolved.

Wed, Apr 3, 4:44 PM · Release-Engineering-Team, Scap, serviceops, Wikimedia-Incident

jijiki added a project to T361720: Helm was left in limbo due to interrupted deployment/rollback : serviceops.

Wed, Apr 3, 4:44 PM · Release-Engineering-Team, Scap, serviceops, Wikimedia-Incident

jijiki closed T361720: Helm was left in limbo due to interrupted deployment/rollback , a subtask of T361706: 2024-04-03 calico/typha down, as Resolved.

Wed, Apr 3, 4:43 PM · Patch-For-Review, Prod-Kubernetes, Wikimedia-Incident

jijiki added a comment to T361720: Helm was left in limbo due to interrupted deployment/rollback .

We depooled mw-web-ro from eqiad, and attempted a rollback

Wed, Apr 3, 4:43 PM · Release-Engineering-Team, Scap, serviceops, Wikimedia-Incident

jijiki removed a project from T361720: Helm was left in limbo due to interrupted deployment/rollback : WMF-NDA.

Wed, Apr 3, 4:41 PM · Release-Engineering-Team, Scap, serviceops, Wikimedia-Incident

jijiki created T361720: Helm was left in limbo due to interrupted deployment/rollback .

Wed, Apr 3, 4:40 PM · Release-Engineering-Team, Scap, serviceops, Wikimedia-Incident

Tue, Apr 2

jijiki added a subtask for T349796: Move MediaWiki jobs to mw-on-k8s: T359154: Update DC switchover cookbooks to handle mw-jobrunners.

Tue, Apr 2, 12:35 PM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s

jijiki added a parent task for T359154: Update DC switchover cookbooks to handle mw-jobrunners: T349796: Move MediaWiki jobs to mw-on-k8s.

Tue, Apr 2, 12:35 PM · Datacenter-Switchover, serviceops

jijiki added a comment to T360596: Figure out a plan to move forward with regarding Redis License changes.

In T360596#9676049, @akosiaris wrote:

My 2, operationally minded, cents says to wait for the dust to settle a little bit before moving forward with any plan. We got time and I have my doubts regarding all 4 different forks surviving.

Tue, Apr 2, 11:40 AM · GitLab (Infrastructure), Patch-For-Review, User-aborrero, serviceops, MediaWiki-Platform-Team (Radar), collaboration-services, Release-Engineering-Team (Radar), Quarry, Toolforge, Software-Licensing, Infrastructure-Foundations, netbox, Platform Team Initiatives (API Gateway), ChangeProp, MediaWiki-File-management, SRE

Thu, Mar 28

jijiki updated the task description for T346690: mcrouter daemonset on mw-on-k8s.

Thu, Mar 28, 4:26 PM · MediaWiki-Platform-Team (Radar), Patch-For-Review, serviceops, MW-on-K8s

Wed, Mar 27

jijiki closed T357547: ☂️ Northward Datacentre Switchover (March 2024) as Resolved.

Switchover is done, it is Day 8, and we are back to Multi-DC. Thank you serviceops and @akosiaris for being good teammates and keeping an eye on things.

Wed, Mar 27, 3:47 PM · Patch-For-Review, Datacenter-Switchover, Data-Persistence, SRE Observability (FY2023/2024-Q3), collaboration-services, observability, serviceops, DC-Ops, Traffic

jijiki closed T358286: SRE comms for Northward Datacentre Switchover (March 2024) as Resolved.

Wed, Mar 27, 3:42 PM · Datacenter-Switchover, serviceops

jijiki closed T358286: SRE comms for Northward Datacentre Switchover (March 2024), a subtask of T357547: ☂️ Northward Datacentre Switchover (March 2024) , as Resolved.

Wed, Mar 27, 3:42 PM · Patch-For-Review, Datacenter-Switchover, Data-Persistence, SRE Observability (FY2023/2024-Q3), collaboration-services, observability, serviceops, DC-Ops, Traffic

jijiki added a subtask for T357547: ☂️ Northward Datacentre Switchover (March 2024) : T361113: Improve readability of Switchover documentation.

Wed, Mar 27, 3:00 PM · Patch-For-Review, Datacenter-Switchover, Data-Persistence, SRE Observability (FY2023/2024-Q3), collaboration-services, observability, serviceops, DC-Ops, Traffic

jijiki added a parent task for T361113: Improve readability of Switchover documentation: T357547: ☂️ Northward Datacentre Switchover (March 2024) .

Wed, Mar 27, 2:59 PM · serviceops

jijiki triaged T361113: Improve readability of Switchover documentation as Low priority.

Wed, Mar 27, 2:59 PM · serviceops

Tue, Mar 26

jijiki updated the task description for T357547: ☂️ Northward Datacentre Switchover (March 2024) .

Tue, Mar 26, 4:18 PM · Patch-For-Review, Datacenter-Switchover, Data-Persistence, SRE Observability (FY2023/2024-Q3), collaboration-services, observability, serviceops, DC-Ops, Traffic

Thu, Mar 21

jijiki updated the task description for T357547: ☂️ Northward Datacentre Switchover (March 2024) .

Thu, Mar 21, 4:21 PM · Patch-For-Review, Datacenter-Switchover, Data-Persistence, SRE Observability (FY2023/2024-Q3), collaboration-services, observability, serviceops, DC-Ops, Traffic

jijiki updated the task description for T357547: ☂️ Northward Datacentre Switchover (March 2024) .

Thu, Mar 21, 4:21 PM · Patch-For-Review, Datacenter-Switchover, Data-Persistence, SRE Observability (FY2023/2024-Q3), collaboration-services, observability, serviceops, DC-Ops, Traffic

Mar 20 2024

jijiki updated the task description for T357547: ☂️ Northward Datacentre Switchover (March 2024) .

Mar 20 2024, 3:50 PM · Patch-For-Review, Datacenter-Switchover, Data-Persistence, SRE Observability (FY2023/2024-Q3), collaboration-services, observability, serviceops, DC-Ops, Traffic

jijiki updated the task description for T357547: ☂️ Northward Datacentre Switchover (March 2024) .

Mar 20 2024, 3:49 PM · Patch-For-Review, Datacenter-Switchover, Data-Persistence, SRE Observability (FY2023/2024-Q3), collaboration-services, observability, serviceops, DC-Ops, Traffic

jijiki updated the task description for T357547: ☂️ Northward Datacentre Switchover (March 2024) .

Mar 20 2024, 3:49 PM · Patch-For-Review, Datacenter-Switchover, Data-Persistence, SRE Observability (FY2023/2024-Q3), collaboration-services, observability, serviceops, DC-Ops, Traffic

Mar 19 2024

jijiki closed T359154: Update DC switchover cookbooks to handle mw-jobrunners, a subtask of T357547: ☂️ Northward Datacentre Switchover (March 2024) , as Resolved.

Mar 19 2024, 1:02 PM · Patch-For-Review, Datacenter-Switchover, Data-Persistence, SRE Observability (FY2023/2024-Q3), collaboration-services, observability, serviceops, DC-Ops, Traffic

jijiki closed T359154: Update DC switchover cookbooks to handle mw-jobrunners as Resolved.

This is done, weill reopen if something goes south

Mar 19 2024, 1:02 PM · Datacenter-Switchover, serviceops

Mar 6 2024

jijiki created T359375: make better use of spicerack's service_catalog().

Mar 6 2024, 2:20 PM · Datacenter-Switchover, serviceops

Mar 5 2024

jijiki created T359177: Migrate switchdc cookbooks to class-based API..

Mar 5 2024, 2:13 PM · serviceops

jijiki updated the task description for T359154: Update DC switchover cookbooks to handle mw-jobrunners.

Mar 5 2024, 11:33 AM · Datacenter-Switchover, serviceops

jijiki created T359154: Update DC switchover cookbooks to handle mw-jobrunners.

Mar 5 2024, 11:32 AM · Datacenter-Switchover, serviceops

jijiki added a comment to T346690: mcrouter daemonset on mw-on-k8s.

mw-mcrouter ds has been deployed on staging mw-mcrouter staging

Mar 5 2024, 9:17 AM · MediaWiki-Platform-Team (Radar), Patch-For-Review, serviceops, MW-on-K8s

Mar 4 2024

jijiki added a comment to T358233: MoveComms support for Northward Datacentre Switchover (March 2024).

Looks alright!

Mar 4 2024, 3:23 PM · User-notice-archive, serviceops, MoveComms-Support

jijiki added a comment to T358233: MoveComms support for Northward Datacentre Switchover (March 2024).

@Trizek-WMF as per our off-phabricator discussion, the major change is that this is not a procedure we test anymore, but it has become standard practice. Please edit the message as you see fit to reflect that.

Mar 4 2024, 3:08 PM · User-notice-archive, serviceops, MoveComms-Support

Feb 29 2024

jijiki changed the status of T346690: mcrouter daemonset on mw-on-k8s from Stalled to In Progress.

Feb 29 2024, 4:26 PM · MediaWiki-Platform-Team (Radar), Patch-For-Review, serviceops, MW-on-K8s

jijiki changed the status of T346690: mcrouter daemonset on mw-on-k8s, a subtask of T277711: Memcached, mcrouter in MediaWiki on Kubernetes, from Stalled to In Progress.

Feb 29 2024, 4:24 PM · serviceops, SRE

Feb 27 2024

jijiki claimed T358597: urldownloader1003's network is unresponsive.

Feb 27 2024, 4:51 PM · serviceops

jijiki updated subscribers of T358597: urldownloader1003's network is unresponsive.

Looking into the issue, we found that around 26th Feb @ ~21:45 UTC, the urldownloader1003 (ganeti VM running on ganeti1027 ie cluster master) lost network connectivity

Feb 27 2024, 4:45 PM · serviceops

jijiki added a comment to T358595: Google and Yandex Translate are not available from cxserver.

This was most likely related to T358597

Feb 27 2024, 4:33 PM · Unplanned-Sprint-Work, Language-Team (Language-2024-January-March), SectionTranslation, CX-cxserver

jijiki added a comment to T358597: urldownloader1003's network is unresponsive.

After issuing a restart, the VM came back to life normally.

Feb 27 2024, 3:57 PM · serviceops

jijiki closed T358597: urldownloader1003's network is unresponsive as Resolved.

Feb 27 2024, 3:56 PM · serviceops

jijiki triaged T358597: urldownloader1003's network is unresponsive as High priority.

Feb 27 2024, 3:43 PM · serviceops

jijiki created T358597: urldownloader1003's network is unresponsive.

Feb 27 2024, 3:43 PM · serviceops

Feb 23 2024

jijiki updated the task description for T358286: SRE comms for Northward Datacentre Switchover (March 2024).

Feb 23 2024, 11:55 AM · Datacenter-Switchover, serviceops

Feb 22 2024

jijiki updated the task description for T358286: SRE comms for Northward Datacentre Switchover (March 2024).

Feb 22 2024, 9:16 PM · Datacenter-Switchover, serviceops

jijiki created T358286: SRE comms for Northward Datacentre Switchover (March 2024).

Feb 22 2024, 8:34 PM · Datacenter-Switchover, serviceops

jijiki updated the task description for T357547: ☂️ Northward Datacentre Switchover (March 2024) .

Feb 22 2024, 3:43 PM · Patch-For-Review, Datacenter-Switchover, Data-Persistence, SRE Observability (FY2023/2024-Q3), collaboration-services, observability, serviceops, DC-Ops, Traffic

jijiki added a subtask for T357547: ☂️ Northward Datacentre Switchover (March 2024) : T358233: MoveComms support for Northward Datacentre Switchover (March 2024).

Feb 22 2024, 3:42 PM · Patch-For-Review, Datacenter-Switchover, Data-Persistence, SRE Observability (FY2023/2024-Q3), collaboration-services, observability, serviceops, DC-Ops, Traffic

jijiki added a parent task for T358233: MoveComms support for Northward Datacentre Switchover (March 2024): T357547: ☂️ Northward Datacentre Switchover (March 2024) .

Feb 22 2024, 3:42 PM · User-notice-archive, serviceops, MoveComms-Support

jijiki created T358233: MoveComms support for Northward Datacentre Switchover (March 2024).

Feb 22 2024, 3:34 PM · User-notice-archive, serviceops, MoveComms-Support

jijiki renamed T357547: ☂️ Northward Datacentre Switchover (March 2024) from Northward Datacentre Switchover (March 2024) to ☂️ Northward Datacentre Switchover (March 2024) .

Feb 22 2024, 3:05 PM · Patch-For-Review, Datacenter-Switchover, Data-Persistence, SRE Observability (FY2023/2024-Q3), collaboration-services, observability, serviceops, DC-Ops, Traffic

Feb 21 2024

bd808 awarded T345868: Rename the shellbox service to shellbox-score a Like token.

Feb 21 2024, 9:57 PM · Shellbox, serviceops

Feb 14 2024

jijiki updated the task description for T357547: ☂️ Northward Datacentre Switchover (March 2024) .

Feb 14 2024, 4:48 PM · Patch-For-Review, Datacenter-Switchover, Data-Persistence, SRE Observability (FY2023/2024-Q3), collaboration-services, observability, serviceops, DC-Ops, Traffic

jijiki created T357547: ☂️ Northward Datacentre Switchover (March 2024) .

Feb 14 2024, 4:36 PM · Patch-For-Review, Datacenter-Switchover, Data-Persistence, SRE Observability (FY2023/2024-Q3), collaboration-services, observability, serviceops, DC-Ops, Traffic

jijiki closed T356766: Determine cause of HTTP 503 errors for ~8% of MediaWiki requests to ipoid service as Resolved.

Things are progressing well after the last change, please reopen if this resurfaces. Shoutout to @akosiaris for lending a hand

Feb 14 2024, 3:52 PM · serviceops, iPoid-Service (iPoid 1.0)

jijiki updated the task description for T356766: Determine cause of HTTP 503 errors for ~8% of MediaWiki requests to ipoid service.

Feb 14 2024, 1:42 PM · serviceops, iPoid-Service (iPoid 1.0)

jijiki added a comment to T356766: Determine cause of HTTP 503 errors for ~8% of MediaWiki requests to ipoid service.

In T356766#9537892, @kostajh wrote:

In T356766#9537874, @jijiki wrote:

meanwhile, we restarted the envoyproxies, which seems to have significantly improved the issue; 503s for now are around 2 per hour

Hmm, I am not seeing that reflected here https://logstash.wikimedia.org/goto/d99b2beddeb28d71ddc74e7298c69cc8

Feb 14 2024, 12:42 PM · serviceops, iPoid-Service (iPoid 1.0)

Feb 13 2024

jijiki updated the task description for T356766: Determine cause of HTTP 503 errors for ~8% of MediaWiki requests to ipoid service.

Feb 13 2024, 3:10 PM · serviceops, iPoid-Service (iPoid 1.0)

jijiki added a comment to T356766: Determine cause of HTTP 503 errors for ~8% of MediaWiki requests to ipoid service.

tcpdump shows that upstream sends an RST, but nothing else useful

Feb 13 2024, 2:19 PM · serviceops, iPoid-Service (iPoid 1.0)

jijiki added a comment to T356766: Determine cause of HTTP 503 errors for ~8% of MediaWiki requests to ipoid service.

In T356766#9534550, @kostajh wrote:

In T356766#9534518, @jijiki wrote:

so far:

dumped traffic at the pod level (ipoid has only one pod, service is low traffic), and I never saw a packet from an appserver whic

from the pod's perspective, there are no 500x errors grafana envoy-telemetry-k8s

nothing standing out on lvs2013

nothing odd on the k8s node itself

Is there something more we could log from the MW side that would help debug this? Is it possible there is some special routing happening because the originating request to ipoid happens in a POST request context (so it always originates from the primary DC)?