Page MenuHomePhabricator

akosiaris (Alexandros Kosiaris)
Site Reliability EngineerAdministrator

Projects (22)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 8:40 AM (478 w, 4 d)
Roles
Administrator
Availability
Available
IRC Nick
akosiaris
LDAP User
Alexandros Kosiaris
MediaWiki User
AKosiaris (WMF) [ Global Accounts ]

Blurb

Recent Activity

Yesterday

akosiaris added a comment to T271142: Some Service Operations clusters apparently do not support IPv6.

@akosiaris sure, and having a cluster deemed as *not* IPv6 ready is totally ok.
The problem arises when the cluster is mixed, with some hosts with AAAA records and some without, as it is the case for the above clusters. As per T271142#8061841 and https://wikitech.wikimedia.org/wiki/DNS/Netbox#Mixed_clusters

Mon, Dec 4, 2:19 PM · Patch-For-Review, Infrastructure-Foundations, Dumps-Generation, IPv6, serviceops, SRE-tools
akosiaris added a comment to T271142: Some Service Operations clusters apparently do not support IPv6.

All of these (which can be grouped in 2 just 2 categores, mw and mc, have been already deemed dangerous and out of scope per my T271142#6955077. I 'll check with the team but I doubt we have any intention of devoting work to do those.

Mon, Dec 4, 12:52 PM · Patch-For-Review, Infrastructure-Foundations, Dumps-Generation, IPv6, serviceops, SRE-tools

Fri, Dec 1

akosiaris placed T352547: decommission rdb1009, rdb1010 up for grabs.
Fri, Dec 1, 4:03 PM · SRE, ops-eqiad, decommission-hardware
akosiaris added a comment to T271142: Some Service Operations clusters apparently do not support IPv6.

@Volans, since dumpsdata[1001-1003].eqiad.wmnet and snapshot[1005-1010].eqiad.wmnet are no longer with serviceops, I think we can resolve this one?

Fri, Dec 1, 4:02 PM · Patch-For-Review, Infrastructure-Foundations, Dumps-Generation, IPv6, serviceops, SRE-tools
akosiaris updated the task description for T271142: Some Service Operations clusters apparently do not support IPv6.
Fri, Dec 1, 3:58 PM · Patch-For-Review, Infrastructure-Foundations, Dumps-Generation, IPv6, serviceops, SRE-tools
akosiaris created T352547: decommission rdb1009, rdb1010.
Fri, Dec 1, 3:18 PM · SRE, ops-eqiad, decommission-hardware
akosiaris closed T326171: rdb101[34] serviceops implementation tracking as Resolved.

This is now done.

Fri, Dec 1, 3:17 PM · SRE, serviceops
akosiaris closed T326171: rdb101[34] serviceops implementation tracking, a subtask of T326170: Q4:rack/setup/install rdb101[34], as Resolved.
Fri, Dec 1, 3:16 PM · SRE, ops-eqiad, serviceops, DC-Ops
akosiaris added a comment to T271142: Some Service Operations clusters apparently do not support IPv6.

rdb*:
have the AAAA record: rdb[2009-2010]
lack the AAAA record: rdb[1009-1012,2007-2008]

Fri, Dec 1, 3:07 PM · Patch-For-Review, Infrastructure-Foundations, Dumps-Generation, IPv6, serviceops, SRE-tools

Sep 21 2023

akosiaris added a comment to T346657: Requests originating from zhwiki wikifeeds caused parsoid outage.

I 've just re-enabled the filter, rejecting traffic, we are meeting issues with high latencies and decreased availability in the parsoid cluster.

Sep 21 2023, 2:45 PM · MW-1.42-notes (1.42.0-wmf.3; 2023-10-31), Maintenance-Worktype, serviceops, Content-Transform-Team-WIP, Chinese-Sites, RESTBase, Parsoid
akosiaris added a comment to T308339: eqiad: move non WMCS servers out of rack C8.

@RobH, Switchover was done yesterday, we are now in codfw for the next 6 months, deploy1002 is no longer used. It can be powered off and moved whenever ops-eqiad feels like it.

Sep 21 2023, 9:48 AM · SRE, DBA, ops-eqiad
akosiaris added a comment to T346657: Requests originating from zhwiki wikifeeds caused parsoid outage.

I 've just disabled the rule. It's still present, but inactive. For other SREs having to re-enable it in an emergency:

Sep 21 2023, 8:24 AM · MW-1.42-notes (1.42.0-wmf.3; 2023-10-31), Maintenance-Worktype, serviceops, Content-Transform-Team-WIP, Chinese-Sites, RESTBase, Parsoid

Sep 20 2023

akosiaris added a comment to T346354: restbase deploys via scap lead to all hosts being disabled in conftool .

We 'll schedule a scap deploy for RESTBase, thanks @jnuche

Sep 20 2023, 8:21 AM · Release-Engineering-Team, Scap, serviceops

Sep 18 2023

akosiaris placed T253058: DRY kafka broker declaration in helmfiles up for grabs.
Sep 18 2023, 5:23 PM · Data-Engineering, Data-Platform-SRE, serviceops, SRE, Event-Platform
akosiaris added a comment to T344751: Decide on default histogram buckets for MediaWiki timers.

with overrides being configured within WMF production wiring, as opposed to provided by the software. That imho violates the separation of concerns and wouldn't scale for other MW users to know about and keep in sync across core and hundreds of extension repos, and across major version upgrades.

Sep 18 2023, 4:01 PM · MediaWiki-libs-Stats, serviceops, Observability-Metrics
akosiaris added a comment to T344751: Decide on default histogram buckets for MediaWiki timers.

The range and sizes of buckets in the histogram can be defined per metric (actually group of metrics, e.g. via a regex). We already use this a lot, e.g. in various services in WikiKube, where each service configures statsd-exporter per metric they want. It is not possible to do this on demand, as in allow the producer to change it, on the fly, without shipping a config change.

Sep 18 2023, 3:58 PM · MediaWiki-libs-Stats, serviceops, Observability-Metrics
akosiaris added a comment to T346579: REST API not returning latest page when queried title is a redirect.

I 'll admit I am a bit stumped here. This is clearly not the CDN's fault as RESTBase exhibits the same behavior while also violating what it advertises as the documentation of the API.

Sep 18 2023, 2:58 PM · SRE, Traffic, RESTBase-API, RESTBase

Sep 15 2023

akosiaris closed T346055: FRUP: Add Applepay verification code to donate wiki as Resolved.

https://donate.wikipedia.org/.well-known/apple-developer-merchantid-domain-association now, in my checks, returns the contents of the file in this task. It might take a bit more (up to 30 minutes) to propagate everywhere. I am resolving this task, feel free to reopen to report issues.

Sep 15 2023, 1:21 PM · Fundraising-Backlog, SecTeam-Processed, serviceops
akosiaris added a comment to T342201: MediaWiki\Extension\Notifications\Api\ApiEchoUnreadNotificationPages::getUnreadNotificationPagesFromForeign: Unexpected API response from {wiki}.

I am gonna add one more data point. In all of these errors, the data.servedby stanza refers to an *eqiad* API server. I looked a bit at the distribution of those API servers to see if there is any pattern that would identify one or more specific ones, (thankfully?) that is not the case, apparently all eqiad API servers have the probability to appear in this dataset.

Sep 15 2023, 8:19 AM · Growth-Team, serviceops, SRE, MediaWiki-Platform-Team, MW-1.41-notes (1.41.0-wmf.20; 2023-08-01), MediaWiki-extensions-CentralAuth, MW-on-K8s, Notifications, Wikimedia-production-error

Sep 14 2023

Restricted Application added a project to T273479: ApiEchoUnreadNotificationPages.php PHP Notice: Undefined index: query: Wikimedia-production-error.

Should we resolve this?

Sep 14 2023, 4:13 PM · Wikimedia-production-error, Growth-Team-Filtering, Growth-Team, Notifications

Sep 13 2023

akosiaris updated the task description for T300152: Investigate Ganeti in routed mode.
Sep 13 2023, 12:59 PM · SRE, netops, Ganeti, Infrastructure-Foundations
akosiaris added a comment to T337649: Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad.

I can sense the frustration pretty clearly […]

Oh, if that came across as venting frustration I have to apologize. It was an attempt at a funny way to communicate 1) that Thumbor needs kicking again (as it has needed periodically since the opening of this task), and 2) that tweaking the parameters for parallelism, queue size, and timeouts is probably not going to be sufficient since it keeps recurring (iow, we'll be back here in a couple of weeks, wasting Hnowlan's time with yet another temporary workaround).

Sep 13 2023, 12:53 PM · Patch-For-Review, MW-1.41-notes (1.41.0-wmf.13; 2023-06-13), All-and-every-Wikisource, serviceops, Thumbor
akosiaris added a comment to T337649: Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad.

I can sense the frustration pretty clearly and I appreciate the effort to illustrate it via this story, avoiding lashing out. As a data point, users aren't the only ones frustrated with the situation. Engineers (developers, software engineers, SREs) are frustrated (and have been for a long time) too.

Sep 13 2023, 9:22 AM · Patch-For-Review, MW-1.41-notes (1.41.0-wmf.13; 2023-06-13), All-and-every-Wikisource, serviceops, Thumbor

Sep 8 2023

akosiaris created P52338 Triggering apertium call via cxserver on k8s-staging.
Sep 8 2023, 11:53 AM

Sep 7 2023

akosiaris added a comment to T345853: Fail event on /dev/md/0:kubernetes2028.

Yes, it is safe, we haven't put yet those in production.

Sep 7 2023, 3:23 PM · serviceops, SRE
akosiaris added a comment to T329491: ICU transition towards ICU 67.

ICU67 images, built and pushed.

Sep 7 2023, 1:26 PM · Patch-For-Review, serviceops-radar, SRE
akosiaris added a comment to T345794: Mobile HTML endpoint returns an empty response.

Editing the article will indeed issue cache purge events for both the CDN and RESTBase.

Sep 7 2023, 11:19 AM · serviceops-radar, Page Content Service, RESTBase-API
akosiaris triaged T345738: etcd in codfw burned all latency SLO error budget as Medium priority.

Given this isn't urgent and we have multiple ways of dealing with this, I 've re-enabled puppet and cadvisor has been started again. Sure enough, the latency has increased again.

Sep 7 2023, 11:18 AM · Patch-For-Review, SRE, Infrastructure-Foundations, serviceops
akosiaris added a comment to T345738: etcd in codfw burned all latency SLO error budget.

Hi

TL;DR

cadvisor is to blame. Adding @fgiunchedi for his information and a thumbs up on disabling cadvisor on conf2* until we can bump their kernel version.

I'm ok to disable cadvisor there, though I gotta ask what are the plans for conf2* upgrades and/or reboot ?

Sep 7 2023, 8:59 AM · Patch-For-Review, SRE, Infrastructure-Foundations, serviceops
akosiaris updated subscribers of T345738: etcd in codfw burned all latency SLO error budget.

There's a few more actionables here:

Sep 7 2023, 8:45 AM · Patch-For-Review, SRE, Infrastructure-Foundations, serviceops
akosiaris added a comment to T345738: etcd in codfw burned all latency SLO error budget.

This isn't present in conf1* hosts, despite also running cadvisor and the same exact version, presumably because of a different kernel version. conf1* hosts are bullseye and conf2* hosts are buster. 5.10.0-15-amd64 vs 4.19.0-20-amd64

Sep 7 2023, 8:42 AM · Patch-For-Review, SRE, Infrastructure-Foundations, serviceops
akosiaris updated subscribers of T345738: etcd in codfw burned all latency SLO error budget.

cadvisor is to blame. Adding @fgiunchedi for his information and a thumbs up on disabling cadvisor on conf2* until we can bump their kernel version.

Sep 7 2023, 8:40 AM · Patch-For-Review, SRE, Infrastructure-Foundations, serviceops
akosiaris edited projects for T345794: Mobile HTML endpoint returns an empty response, added: serviceops-radar; removed serviceops.

I can reproduce internally, this has nothing to do with the CDN, it looks more like a RESTBase or a PCS service issue.

Sep 7 2023, 7:31 AM · serviceops-radar, Page Content Service, RESTBase-API

Sep 6 2023

akosiaris added a comment to T345738: etcd in codfw burned all latency SLO error budget.

So, ethtool -G eno1 rx 1000 apparently did the trick

Sep 6 2023, 3:35 PM · Patch-For-Review, SRE, Infrastructure-Foundations, serviceops

Sep 5 2023

akosiaris added a comment to T338471: Replace the current recommendation-api service with a newer version.

Adding a data point that just crossed my mind, just to rule it out.

Sep 5 2023, 11:37 AM · Machine-Learning-Team, serviceops

Sep 4 2023

akosiaris added a comment to T329491: ICU transition towards ICU 67.

I 've uploaded changes for icu67 php7.4 images for use with a shellbox deployment. I 'll also create a temporary shellbox deployment based on those.

Sep 4 2023, 2:59 PM · Patch-For-Review, serviceops-radar, SRE
akosiaris awarded T345265: CommRel support for September 2023 Datacenter Switchover a Love token.
Sep 4 2023, 2:40 PM · CommRel-Specialists-Support (Jul-Sep-2023), serviceops, Datacenter-Switchover, SRE
akosiaris updated subscribers of T345561: Upgrade the MediaWiki servers to ICU 67.
Sep 4 2023, 12:14 PM · serviceops
akosiaris added a comment to T345290: Deploy a more recent version of Mathoid to production than 2023-02-21.

I 've tried and did manage to find an owner in the past, but we are back to square 1 regarding this. I just hope that the plans to fully migrate math rendering in the browser will materialize and we 'll be able to undeploy this service at some point.

It will take a while. While we will be ready for opt-in very soon (maybe even this month), it will be a long way to fix all issues. To convince people that a browser is not TeX engine, will be even harder and to advocate for the advantages of inherent math display will take some time as well.

Sep 4 2023, 12:07 PM · Math, Mathoid
akosiaris moved T345561: Upgrade the MediaWiki servers to ICU 67 from Incoming 🐫 to Doing 😎 on the serviceops board.
Sep 4 2023, 11:35 AM · serviceops
akosiaris updated the task description for T345561: Upgrade the MediaWiki servers to ICU 67.
Sep 4 2023, 11:34 AM · serviceops
akosiaris created T345561: Upgrade the MediaWiki servers to ICU 67.
Sep 4 2023, 11:34 AM · serviceops
akosiaris closed T345290: Deploy a more recent version of Mathoid to production than 2023-02-21 as Resolved.

I 've merged and deployed the change mentioned in T344747, alongside the dependent changes. curl call listed in the Mathoid page returns succesfully an SVG, and openapi checks are apparently also fine. I 'll resolve this one but I 'll note that for

Sep 4 2023, 9:52 AM · Math, Mathoid
akosiaris added a comment to T345290: Deploy a more recent version of Mathoid to production than 2023-02-21.

I think that the subject is misleading. Per https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/33520a1a4409a9b0cef71a0b4baba148f64f2d40 (gerrit change is https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/809194) deployed mathoid version isn't 2023-02-21 but 2022-06-28-144716-production (which is even older).

Sep 4 2023, 9:31 AM · Math, Mathoid

Sep 1 2023

akosiaris added a comment to T341121: Decommission dbproxy10[12-17].

@Ladsgroup @Marostegui , dependent T340843 is now resolved, you can proceed with the decom process. Thanks and sorry for waiting so long.

Sep 1 2023, 1:02 PM · DBA
akosiaris moved T340843: WikiKube: Investigate how to abstract misc Mariadb clusters host/ip information so that no deployment of apps is needed when a master is failed over from Doing 😎 to ⎈Kubernetes on the serviceops board.
Sep 1 2023, 1:01 PM · Patch-For-Review, Data-Persistence, serviceops
akosiaris closed T340843: WikiKube: Investigate how to abstract misc Mariadb clusters host/ip information so that no deployment of apps is needed when a master is failed over as Resolved.

ipoid and toolhub done today. Resolving this. We 've chosen a path forward, implemented it and migrated the services that utilized hardcoded dbproxies networking rules to the new thing.

Sep 1 2023, 1:01 PM · Patch-For-Review, Data-Persistence, serviceops
akosiaris updated the task description for T300033: Use cert-manager for service-proxy certificate creation.
Sep 1 2023, 12:59 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
akosiaris added a comment to T331699: Migrate the r/w LDAP servers to Bookworm and MDB storage.

One other option would be to simply start with a fresh, parallel setup and skip Bullseye entirely:

  • Create new ldap-rw1001/ldap-rw2001 VMs using Bookworm and set profile::openldap::storage_backend to "mdb" and configure them as a synchronisation pair
  • slapcat the existing data from serpens to an LDIF (ACLs, LDAP extensions are all distributed via Puppet)
  • slapadd the LDIF on ldap-rw1001 and let it sync towards ldap-rw2001
  • Create four additional ldap-replica VMs running Bookworm and sync them against ldap-rw1001/2001
  • Test the new setup
  • When everything works as expected in the parallel setup, revert the new Bookworm hosts to a clean state
  • Setup a window (1-2 hours) during which no r/w changes are possible (disable Bitu temporarily, tell SREs to avoid LDAP changes, disable Horizon)
  • Repeat the same import as above with current data, if all is well:
  • Point ldap-rw.codfw.w.o to ldap-rw2001
  • Point ldap-rw.eqiad.w.o to ldap-rw1001
  • Depool all older readonly replicas in favour of the new bookworm ones
  • If there are unforeseen issues we can simply revert to serpens/seaborgium/old replicas
  • If all is well, decom serpens/seaborgium and the old replicas
Sep 1 2023, 12:20 PM · Patch-For-Review, LDAP, Infrastructure-Foundations, SRE

Aug 30 2023

akosiaris updated the task description for T300033: Use cert-manager for service-proxy certificate creation.
Aug 30 2023, 2:51 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
akosiaris updated the task description for T340843: WikiKube: Investigate how to abstract misc Mariadb clusters host/ip information so that no deployment of apps is needed when a master is failed over.
Aug 30 2023, 1:54 PM · Patch-For-Review, Data-Persistence, serviceops
akosiaris added a comment to T340843: WikiKube: Investigate how to abstract misc Mariadb clusters host/ip information so that no deployment of apps is needed when a master is failed over.

linkrecommendation done today.

Aug 30 2023, 1:50 PM · Patch-For-Review, Data-Persistence, serviceops
akosiaris updated subscribers of T340111: Configure bgp-error-tolerance on Juniper routers.

FYI, same mitigation applies to https://supportportal.juniper.net/s/article/2023-08-29-Out-of-Cycle-Security-Bulletin-Junos-OS-and-Junos-OS-Evolved-A-crafted-BGP-UPDATE-message-allows-a-remote-attacker-to-de-peer-reset-BGP-sessions-CVE-2023-4481?language=en_US, released yesterday

Aug 30 2023, 5:40 AM · SRE, Infrastructure-Foundations, netops

Aug 29 2023

akosiaris added a comment to T344751: Decide on default histogram buckets for MediaWiki timers.

Those numbers are for summary quantiles, not histograms buckets. Summaries aren't aggregatable and in almost all cases regarding multi-instance (e.g. multiple servers, multiple pods) metrics you don't want to deal with them. Almost all efforts to run aggregation queries over them will result in wrong results.

Aug 29 2023, 7:14 PM · MediaWiki-libs-Stats, serviceops, Observability-Metrics
akosiaris added a comment to T340843: WikiKube: Investigate how to abstract misc Mariadb clusters host/ip information so that no deployment of apps is needed when a master is failed over.

Patch was merged and deployed for cxserver. Things went ok after a couple of roadbump and corresponding brown paper bag fixes. I 'll be deploying the changes to ipoid, toolhub and linkrecommendation in the next few days and remove the hardcoded references to dbproxies from those deployments.

Aug 29 2023, 4:38 PM · Patch-For-Review, Data-Persistence, serviceops
akosiaris closed T341117: cxserver: Section Mapping Database (m5) not accessible by certain region as Resolved.

Fix merged and deployed. Some hiccups aside, it works fine across all 3 environments (staging, production eqiad, production codfw). I 'll resolve this one.

Aug 29 2023, 4:37 PM · Language-Team, serviceops, Kubernetes, CX-cxserver
akosiaris closed T341117: cxserver: Section Mapping Database (m5) not accessible by certain region, a subtask of T340843: WikiKube: Investigate how to abstract misc Mariadb clusters host/ip information so that no deployment of apps is needed when a master is failed over, as Resolved.
Aug 29 2023, 4:36 PM · Patch-For-Review, Data-Persistence, serviceops
akosiaris updated the task description for T341121: Decommission dbproxy10[12-17].
Aug 29 2023, 4:36 PM · DBA
akosiaris changed the status of T340843: WikiKube: Investigate how to abstract misc Mariadb clusters host/ip information so that no deployment of apps is needed when a master is failed over from Open to In Progress.
Aug 29 2023, 2:48 PM · Patch-For-Review, Data-Persistence, serviceops

Aug 28 2023

akosiaris added a comment to T340843: WikiKube: Investigate how to abstract misc Mariadb clusters host/ip information so that no deployment of apps is needed when a master is failed over.

Hi all, is there any update on this please? linkrecommendation service had a second incident today caused by the same problem. I've updated the network policies again, but it'd be great to implement one of the solutions described in this task soon-ish to prevent future outages.

Aug 28 2023, 1:39 PM · Patch-For-Review, Data-Persistence, serviceops

Aug 22 2023

akosiaris added a comment to T340036: Setup allowed list for MCS decom.

Hi there! I'm responsible for Kiwix migration to another API, but given the discussion above I'm curious whether you have plans to add MWOffliner to the allowed list to get access to /mobile-sections. And if so, how long it will be working? I assume that even though MWOffliner User-Agent was added earlier, MCS completely disabled already, because I've got 403 error page for this curl request:

curl -H "User-Agent: MWoffliner/1.13.0 (contact@kiwix.org)" https://en.wikipedia.org/api/rest_v1/page/mobile-sections

Aug 22 2023, 8:41 AM · affects-Kiwix-and-openZIM, Content-Transform-Team-WIP, RESTBase Sunsetting, SRE, serviceops, Traffic, Mobile-Content-Service

Jul 27 2023

akosiaris moved T290536: Serve production traffic via Kubernetes from Backlog to In Progress on the MW-on-K8s board.
Jul 27 2023, 8:57 AM · Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
akosiaris moved T342748: mw-on-k8s app container CPU throttling at low average load from Backlog to In Progress on the MW-on-K8s board.
Jul 27 2023, 8:51 AM · serviceops, MW-on-K8s
akosiaris moved T341859: Move noc.wikimedia.org to kubernetes from Backlog to In Progress on the MW-on-K8s board.
Jul 27 2023, 8:51 AM · noc.wikimedia.org, serviceops, MW-on-K8s

Jul 26 2023

akosiaris added a comment to T341122: Implement daily data update routine.

Those numbers don't immediately raise alarm bells for me -- "storage" doesn't mean anything persistent, only ephemeral data that can disappear when the script exits, right? As long as that's the case (and assuming you're using ~1 CPU), you should be fine. I'm tagging in @akosiaris to confirm the resource request is sensible.

Jul 26 2023, 4:18 PM · Trust and Safety Product Sprint (Sprint Bodhrán), Patch-For-Review, Anti-Harassment (AHaT Sprint 32 - Baseball Cap), iPoid-Service
akosiaris awarded T342250: Alert triage: KubeletOperationalLatency a Love token.
Jul 26 2023, 2:32 PM · sre-alert-triage, Prod-Kubernetes, Kubernetes, serviceops
akosiaris added a comment to T342250: Alert triage: KubeletOperationalLatency.

Increase the threshold of the alert from 1s to 2s (or 1.5) as I'm not aware of any issues arising from this

Jul 26 2023, 10:25 AM · sre-alert-triage, Prod-Kubernetes, Kubernetes, serviceops
akosiaris closed T243858: Request to block ActionApi client (based on a specific user agent header) as Declined.

I am gonna close this as declined. While we do have the ability to block requests based on user-agent, we don't do that on request.

Jul 26 2023, 8:42 AM · serviceops, SRE
akosiaris added a comment to T297314: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch).

I 've gone ahead and populated the Saturation panels. Traffic, Errors and Latencies will need more work, but I will not be able to help with that anytime soon.

Jul 26 2023, 8:36 AM · Abstract Wikipedia team, serviceops, Service-deployment-requests, Services, SRE
akosiaris added a comment to T297314: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch).

I 've gone ahead and created https://grafana.wikimedia.org/d/FEkiKFqVk/wikifunctions?orgId=1

Jul 26 2023, 8:13 AM · Abstract Wikipedia team, serviceops, Service-deployment-requests, Services, SRE
akosiaris closed T340087: Deploy wikidiff2 1.14.1 as Resolved.

php7.4-fpm-multiversion-base rebuilt as well, should make it out to mw-on-k8s in the next deployments. I think we can resolve this now. Feel free to reopn.

Jul 26 2023, 8:11 AM · Community-Tech (CommTech-Kanban), Better-Diffs-2023, wikidiff2, serviceops
akosiaris updated the task description for T297314: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch).
Jul 26 2023, 7:46 AM · Abstract Wikipedia team, serviceops, Service-deployment-requests, Services, SRE

Jul 25 2023

akosiaris closed T326785: Kubernetes Wikifunctions security and control measures as Resolved.

The apparmor changes have been merged. I think the goal of this task is done. I 'll resolve, but feel free to reopen.

Jul 25 2023, 12:21 PM · Abstract Wikipedia team (Phase λ – Launch), Patch-For-Review, serviceops
akosiaris closed T326785: Kubernetes Wikifunctions security and control measures, a subtask of T282913: Implement agreed security and control measures in the Wikifunctions system, as Resolved.
Jul 25 2023, 12:20 PM · Abstract Wikipedia team (Phase λ – Launch), Epic
akosiaris closed T326785: Kubernetes Wikifunctions security and control measures, a subtask of T313226: Get all SRE-type things ready for launching Wikifunctions, as Resolved.
Jul 25 2023, 12:20 PM · Epic, Abstract Wikipedia team (Phase λ – Launch)
akosiaris added a comment to T326785: Kubernetes Wikifunctions security and control measures.
akosiaris@kubernetes1007:~$ sudo apparmor_status 
apparmor module is loaded.
10 profiles are loaded.
10 profiles are in enforce mode.
   /usr/bin/man
   docker-default
   lsb_release
   man_filter
   man_groff
   nvidia_modprobe
   nvidia_modprobe//kmod
   tcpdump
   wikifunctions-evaluator
   wikifunctions-orchestrator
<snip>
Jul 25 2023, 11:39 AM · Abstract Wikipedia team (Phase λ – Launch), Patch-For-Review, serviceops

Jul 24 2023

akosiaris closed T342085: Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 as Resolved.

This is apparently stopped happening yesterday, the 23rd of July ~9:00am

Jul 24 2023, 5:00 PM · Parsoid (Tracking), serviceops
akosiaris added a comment to T340087: Deploy wikidiff2 1.14.1.

OK, scheduling for tomorrow then, https://wikitech.wikimedia.org/wiki/Deployments#Tuesday,_July_25.

Jul 24 2023, 3:58 PM · Community-Tech (CommTech-Kanban), Better-Diffs-2023, wikidiff2, serviceops
akosiaris added a comment to T288198: Pushes to docker-registry fail for images with compressed layers of size >1GB.

Another case of failing to push a large image: T342084. Is it possible to configure NGINX to use a different (large, on-disk) storage area for certain URLs?

Use case from ML - we are porting the recommendation-api Python service from wmf-cloud to k8s. It uses a dict file containing embeddings (basically a serialized blob) that weights around 3GB (so one big layer when we add it) and that it is loaded when the application boostraps. We are working with Research to figure out how to handle cases like this one, since "embeddings" could get multiple shapes and sizes, and they'd need to have flexibility when they experiment (like, using a serialized blob as starter and then think about some specialized datastore). This is not an "experimentation" use case of course, but the recommendation-api's setting may vary in the future. We could think about other options if raising the tmpfs mount is not an option (like fetching from swift or similar).

Jul 24 2023, 9:44 AM · Release Pipeline, MW-on-K8s, serviceops
akosiaris added a comment to T288198: Pushes to docker-registry fail for images with compressed layers of size >1GB.

Another case of failing to push a large image: T342084. Is it possible to configure NGINX to use a different (large, on-disk) storage area for certain URLs?

Jul 24 2023, 9:26 AM · Release Pipeline, MW-on-K8s, serviceops

Jul 19 2023

akosiaris updated subscribers of T308339: eqiad: move non WMCS servers out of rack C8.

@RobH mw hosts are 3 api servers and 3 appservers. You can do them anytime. Also it requires is a downtime and a poweroff per the description.

Jul 19 2023, 1:35 PM · SRE, DBA, ops-eqiad
akosiaris added a comment to T340087: Deploy wikidiff2 1.14.1.

[...]
@TheresNoTime let me know when we should proceed with the next step of the deployment. Which should be across the whole fleet, unless you have a different idea.

Just to check, does the deployment to the whole fleet require a deployment window? It's sat on the canaries with no issues for about a week now, so we could probably start thinking about that full deployment..

Jul 19 2023, 12:48 PM · Community-Tech (CommTech-Kanban), Better-Diffs-2023, wikidiff2, serviceops

Jul 18 2023

akosiaris moved T342085: Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 from Incoming 🐫 to Doing 😎 on the serviceops board.
Jul 18 2023, 12:19 PM · Parsoid (Tracking), serviceops
akosiaris added a comment to T340935: Some apache access logs are invalid json .

A search says: https://svn.apache.org/viewvc/httpd/httpd/trunk/modules/loggers/mod_log_config.c?r1=98912&r2=98911&pathrev=98912

Jul 18 2023, 10:39 AM · Observability-Logging, serviceops, MW-on-K8s
akosiaris added a comment to T340935: Some apache access logs are invalid json .

Just found something more important than User-Agent unfortunately

Jul 18 2023, 10:06 AM · Observability-Logging, serviceops, MW-on-K8s
akosiaris added a comment to T342085: Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30.

This is apparently due to transclusions per https://grafana.wikimedia.org/d/000300/change-propagation?orgId=1&from=1689418150653&to=1689449568835&viewPanel=27

Jul 18 2023, 9:21 AM · Parsoid (Tracking), serviceops
akosiaris updated subscribers of T341625: Requesting permission to use kafka-main cluster to transport CirrusSearch updates.
Jul 18 2023, 7:06 AM · Discovery-Search (Current work), serviceops, Data-Platform-SRE

Jul 17 2023

akosiaris edited projects for T341625: Requesting permission to use kafka-main cluster to transport CirrusSearch updates, added: serviceops; removed serviceops-radar.
Jul 17 2023, 3:14 PM · Discovery-Search (Current work), serviceops, Data-Platform-SRE
akosiaris added a comment to T318695: Future of Thumbor's memcached backend.

@hnowlan @jijiki. nutcracker removal merged and deployed. I am gonna let you have the pleasure of resolving this task :-)

Jul 17 2023, 2:16 PM · Patch-For-Review, Platform Team Workboards (Platform Engineering Reliability), Thumbor Migration, SRE, serviceops, Thumbor
akosiaris updated the task description for T318695: Future of Thumbor's memcached backend.
Jul 17 2023, 2:14 PM · Patch-For-Review, Platform Team Workboards (Platform Engineering Reliability), Thumbor Migration, SRE, serviceops, Thumbor

Jul 12 2023

akosiaris added a comment to T339102: Allow Wikimedia Maps usage on vikidia.org.

@akosiaris as one of the maintainers of the Maps stack I'll leave my approval and ping @SLopes-WMF to complete as CTT manager.

Jul 12 2023, 1:10 PM · Patch-For-Review, serviceops-radar, Maps
akosiaris added a comment to T340087: Deploy wikidiff2 1.14.1.

JFTR; I've also rebuilt/uploaded wikidiff 1.41.1 for component/icu67 (so that we don't regress when the ICU67 migration starts)

Jul 12 2023, 12:56 PM · Community-Tech (CommTech-Kanban), Better-Diffs-2023, wikidiff2, serviceops
akosiaris claimed T340087: Deploy wikidiff2 1.14.1.

Looking quickly at mw-canaries and mwdebug, they all have 1.13.0-1+wmf1+buster1

Jul 12 2023, 12:29 PM · Community-Tech (CommTech-Kanban), Better-Diffs-2023, wikidiff2, serviceops

Jul 11 2023

akosiaris added a comment to T338357: Pushing jobs to jobqueue is slow again.

Can I just say that this is pretty awesome? Especially the max latencies for kafka are pretty telling. Keep up the good work on this one!

Jul 11 2023, 8:43 AM · ChangeProp, WMF-JobQueue

Jul 10 2023

akosiaris added a comment to T340036: Setup allowed list for MCS decom.

Wikiwand/0.1 (https://www.wikiwand.com; admin@wikiwand.com) added to the list of user-agents. Please advise if it doesn't work, otherwise please resolve.

Jul 10 2023, 4:52 PM · affects-Kiwix-and-openZIM, Content-Transform-Team-WIP, RESTBase Sunsetting, SRE, serviceops, Traffic, Mobile-Content-Service
akosiaris updated the task description for T341468: Migrate SRE repositories to GitLab.
Jul 10 2023, 4:42 PM · GitLab (Project Migration), collaboration-services
akosiaris updated the task description for T341468: Migrate SRE repositories to GitLab.
Jul 10 2023, 4:42 PM · GitLab (Project Migration), collaboration-services
akosiaris edited projects for T255132: Better handling of memcached service, added: Infrastructure-Foundations; removed serviceops, SRE.

@Dzahn, Judging from the content of the task, this is for Infrastructure-Foundations, not serviceops, retagging.

Jul 10 2023, 3:55 PM · CAS-SSO, Infrastructure-Foundations

Jul 7 2023

akosiaris added a comment to T338471: Replace the current recommendation-api service with a newer version.

Just to add my 2 cents as a generic observation.

Jul 7 2023, 2:45 PM · Machine-Learning-Team, serviceops
akosiaris closed T258697: stop using $::site in description field of service.yaml as Resolved.

PCC at https://puppet-compiler.wmflabs.org/output/936062/42341/ says 0 diff for alert hosts, lvs hosts see a comment change in configuration and there arguably the .discovery.wmnet approach is better anyway informationally. The alerting part of this task doesn't apply anymore anyway, I am gonna resolve this.

Jul 7 2023, 11:13 AM · serviceops, observability, SRE