Page MenuHomePhabricator

akosiaris (Alexandros Kosiaris)
Site Reliability EngineerAdministrator

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Friday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 8:40 AM (345 w, 4 d)
Roles
Administrator
Availability
Available
IRC Nick
akosiaris
LDAP User
Alexandros Kosiaris
MediaWiki User
AKosiaris (WMF) [ Global Accounts ]

Blurb

Recent Activity

Yesterday

akosiaris added a comment to T238909: Proposal: simplify set up of a new load-balanced service on kubernetes.

PR is at https://github.com/projectcalico/confd/pull/515, waiting for review now. It's been tested locally in a couple of bird containers doing a full mesh with each other.

Tue, May 18, 3:43 PM · SRE, Prod-Kubernetes, Pybal, Traffic, serviceops

Mon, May 17

akosiaris added a comment to T120242: Consistent MediaWiki state change events | MediaWiki events as source of truth.

Do we have estimations (or even better hard data) as to the number of missed events?

See T215001: Revisions missing from mediawiki_revision_create. A patch by Clara and Petr is going out with the wmf.5 train this week that may mitigate the majority of missing events.

Mon, May 17, 1:36 PM · DBA, WMF-Architecture-Team, Platform Team Legacy (Later), Analytics, Event-Platform, Services (later)

Fri, May 14

akosiaris added a comment to T279411: Determine why service responses are slow and what we can do about it.

There was a big rise in (measured) p99 latency yesterday:
https://grafana.wikimedia.org/d/CI6JRnLMz/linkrecommendation?viewPanel=122&orgId=1&from=1620773010687&to=1621000590069


So it seems like fixing the bucket issue from T279411#7065251 worked, and the real p99 latency is around 10s? (Which is a lot better than what I expected, TBH.)

Fri, May 14, 2:21 PM · Patch-For-Review, Growth-Team (Current Sprint), serviceops, Data-Persistence (Consultation), Add-Link
akosiaris added a comment to T120242: Consistent MediaWiki state change events | MediaWiki events as source of truth.

Drive by comments by yours truly:

Fri, May 14, 11:01 AM · DBA, WMF-Architecture-Team, Platform Team Legacy (Later), Analytics, Event-Platform, Services (later)
akosiaris added a comment to T282824: MW container image build workflow vs docker-registry caching.

The single-version image will be tagged with the train branch (e.g. wmf-1.37.0-wmf.4) and pushed to the registry, probably updating an existing tag.

Fri, May 14, 8:31 AM · serviceops, MW-on-K8s

Thu, May 13

akosiaris added a comment to T238909: Proposal: simplify set up of a new load-balanced service on kubernetes.

Upstream calico issue at https://github.com/projectcalico/calico/issues/4607

Thu, May 13, 3:36 PM · SRE, Prod-Kubernetes, Pybal, Traffic, serviceops

Wed, May 12

akosiaris added a comment to T279411: Determine why service responses are slow and what we can do about it.

@akosiaris We seem to have gone over the max memory usage limit again

although the average looks OK.

Not really, but it's extremely close meaning that 1 of the pods is maybe not given more memory. That usually tends to lead to extended GCs cycles manifesting as high CPU usage but the avg/max CPU diagrams do not showcase anything like that. Since I also don't see any sign of that causing a problem in the rest of the metrics (errors, latencies) I am inclined to say it's ok. Does that sounds reasonable to you?

I see. Yeah that sounds fine. I see 3 "WORKER TIMEOUT" events in the logs for the service in the last 24 hours, which is more than we've seen recently, but also nothing to worry about IMO.

Wed, May 12, 2:17 PM · Patch-For-Review, Growth-Team (Current Sprint), serviceops, Data-Persistence (Consultation), Add-Link
akosiaris added a comment to T279411: Determine why service responses are slow and what we can do about it.

@akosiaris We seem to have gone over the max memory usage limit again

although the average looks OK.

Wed, May 12, 1:52 PM · Patch-For-Review, Growth-Team (Current Sprint), serviceops, Data-Persistence (Consultation), Add-Link

Thu, May 6

akosiaris added a comment to T279411: Determine why service responses are slow and what we can do about it.

I 'll admit that with a lack of clear SLO, I am not sure what that translates to for the service itself. Since it's being called by a maint script only (is that still true? I guess but doublechecking), there should be no user visible consequence which is a plus. maintenance script execution times must have increased but that might be fully ok right now.

Yeah, it is still called from a maintenance script, the only difference is that there are now five production wikis instead of one, and one instance of the script is running for each, so more parallel requests.

Thu, May 6, 1:06 PM · Patch-For-Review, Growth-Team (Current Sprint), serviceops, Data-Persistence (Consultation), Add-Link
akosiaris added a comment to T279411: Determine why service responses are slow and what we can do about it.

Post deployment all 4 metrics (cpu/memory avg/maxes) look quite a bit better

Thu, May 6, 10:18 AM · Patch-For-Review, Growth-Team (Current Sprint), serviceops, Data-Persistence (Consultation), Add-Link
akosiaris added a comment to T279411: Determine why service responses are slow and what we can do about it.

Another snapshot from today:

It looks like we're getting pretty close to the memory limit. I'm also unclear what happened from 06:00 to 12:30 on 2021-05-04.

Nothing unusual seen in logstash.

Thu, May 6, 9:29 AM · Patch-For-Review, Growth-Team (Current Sprint), serviceops, Data-Persistence (Consultation), Add-Link

Wed, May 5

akosiaris added a comment to T40010: RFC: Re-evaluate librsvg as SVG renderer on Wikimedia wikis.

OK, sorry for trying to clarify and help. Got the message, will keep out of the discussion going forward again.

Wed, May 5, 11:29 AM · TechCom-RFC, MediaWiki-File-management, Commons, Multimedia, Wikimedia-SVG-rendering

Wed, Apr 28

akosiaris added a comment to T274463: Backups for GitLab.

CCing @akosiaris for rationale for hourly backups (see recent comments) as committer of the original schedule, in case he can provide extra background/requirements of past recoveries.

Wed, Apr 28, 9:01 AM · Data-Persistence-Backup, Patch-For-Review, User-brennen, GitLab (Initialization)

Tue, Apr 27

akosiaris changed the status of T273140: decommission rdb200[3456].codfw.wmnet from Stalled to Open.
Tue, Apr 27, 3:40 PM · SRE, ops-codfw, DC-Ops, serviceops, decommission-hardware
akosiaris renamed T273140: decommission rdb200[3456].codfw.wmnet from decommission rdb200[34].codfw.wmnet to decommission rdb200[3456].codfw.wmnet.
Tue, Apr 27, 3:40 PM · SRE, ops-codfw, DC-Ops, serviceops, decommission-hardware
akosiaris closed T255681: Put rdb200[78] into service, a subtask of T251626: (Need By: TDB) rack/setup/install rdb200[78], as Resolved.
Tue, Apr 27, 3:38 PM · SRE, ops-codfw, DC-Ops
akosiaris closed T255681: Put rdb200[78] into service as Resolved.

Resolving. The child tasks are done, we are moving to decommisioning now

Tue, Apr 27, 3:38 PM · serviceops, User-jijiki, SRE
akosiaris closed T255250: Replace rdb200[34] with rdb200[78] as Resolved.

Servers switched over, fully functional now.

Tue, Apr 27, 3:37 PM · Patch-For-Review, serviceops
akosiaris closed T255250: Replace rdb200[34] with rdb200[78], a subtask of T273140: decommission rdb200[3456].codfw.wmnet, as Resolved.
Tue, Apr 27, 3:37 PM · SRE, ops-codfw, DC-Ops, serviceops, decommission-hardware
akosiaris closed T281216: Replace rdb2005, rdb2006 with rdb2009, rdb2010 as Resolved.

Done. All apps migrated, replication broken, now rdb2009 and rdb2010 are the canonical ones to use

Tue, Apr 27, 3:34 PM · SRE, serviceops
akosiaris updated the task description for T255250: Replace rdb200[34] with rdb200[78].
Tue, Apr 27, 3:33 PM · Patch-For-Review, serviceops
akosiaris created T281225: Put rdb20[09|10] into service.
Tue, Apr 27, 8:48 AM · SRE, serviceops
akosiaris changed the status of T273140: decommission rdb200[3456].codfw.wmnet from Open to Stalled.

Stalling until T255250 is completed.

Tue, Apr 27, 8:20 AM · SRE, ops-codfw, DC-Ops, serviceops, decommission-hardware
akosiaris added a parent task for T255250: Replace rdb200[34] with rdb200[78]: T273140: decommission rdb200[3456].codfw.wmnet.
Tue, Apr 27, 8:19 AM · Patch-For-Review, serviceops
akosiaris added a subtask for T273140: decommission rdb200[3456].codfw.wmnet: T255250: Replace rdb200[34] with rdb200[78].
Tue, Apr 27, 8:19 AM · SRE, ops-codfw, DC-Ops, serviceops, decommission-hardware

Mon, Apr 26

akosiaris added a comment to T280718: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive.

Reading the answers (thanks!!!) I understand that

Mon, Apr 26, 4:12 PM · Patch-For-Review, serviceops-radar, SRE, Wikimedia-SVG-rendering
akosiaris closed T280967: Requesting access to Wikimedia Analytics Data for Aisha Khatun as Resolved.

@AKhatun_WMF Access has been granted. I 'll resolve this task, feel free to reopen though if problems arise.

Mon, Apr 26, 4:00 PM · SRE, SRE-Access-Requests
akosiaris updated the task description for T280967: Requesting access to Wikimedia Analytics Data for Aisha Khatun.
Mon, Apr 26, 3:29 PM · SRE, SRE-Access-Requests
akosiaris added a comment to T279100: Have some dedicated jobrunners that aren't active videoscalers.

So, the crux of the issue is at those 2 functions below

def pool(self, pooled):
    """Pool a list of services"""
    try:
        for obj in pooled:
            obj.update({"pooled": "yes"})
        self._verify_status(True)
        return True 
    except (BackendError, PoolStatusError) as e:
        logger.error("Error depooling the servers: %s", e)
        return False

def _verify_status(self, want_pooled):
    if want_pooled:
        desired_status = "enabled/up/pooled"
    else:
        desired_status = "disabled/*/not pooled"
    for baseurl in self.lvs_uris:
        parsed = parse.urlparse(baseurl)
        url = "{baseurl}/{fqdn}".format(baseurl=baseurl, fqdn=self.fqdn)
        # This will raise a PoolStatusError
        logger.debug("Now verifying %s", url) 
        self._fetch_retry(url, want_pooled, parsed, desired_status)

The pool() function will get a list of services to pool, in the specific case the jeena raised, it will only pool the service that was pooled already before the script started (that is the jobrunner service) and leave the videoscaler service depooled (correctly). However it calls the _verify_status() function that has no such knowledge of previous state and relies only on the lvs uris passed on the command line.

Let me see if I can fix that easily.

I think we should pool all servers back to all pools, and lower the weight of those we originally wanted in one pool, until we fix the restart script properly.

Mon, Apr 26, 2:35 PM · Patch-For-Review, WMF-JobQueue, Sustainability (Incident Followup), SRE, serviceops
akosiaris added a comment to T280967: Requesting access to Wikimedia Analytics Data for Aisha Khatun.

Account has been added to wmf ldap group, waiting for analytics-privatedata-users approval from Andrew before merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/682601

Mon, Apr 26, 11:20 AM · SRE, SRE-Access-Requests
akosiaris updated subscribers of T280967: Requesting access to Wikimedia Analytics Data for Aisha Khatun.

@Ottomata, since access to analytics-privatedata-users is asked for, we require your approval too on this task. Thanks!

Mon, Apr 26, 11:17 AM · SRE, SRE-Access-Requests
akosiaris updated the task description for T280967: Requesting access to Wikimedia Analytics Data for Aisha Khatun.
Mon, Apr 26, 11:17 AM · SRE, SRE-Access-Requests
akosiaris triaged T281107: ms-be1062 fell off the network, causing swift timeouts as High priority.
Mon, Apr 26, 9:59 AM · Wikimedia-Incident, SRE, SRE-swift-storage
akosiaris triaged T281095: Move paging for librenms from icinga to AM as Medium priority.
Mon, Apr 26, 9:59 AM · Patch-For-Review, SRE, User-fgiunchedi, netops, observability
akosiaris triaged T281090: Various debmonitor-client systemdtimer errors starting April 21st as Medium priority.
Mon, Apr 26, 9:59 AM · SRE, SRE-tools
akosiaris triaged T281055: mr1 port utilization alerts shouldn't mention hash page in their IRC logs as Medium priority.
Mon, Apr 26, 9:59 AM · SRE, netops
akosiaris triaged T281048: mwlog1001 is running out of free space on /srv/mw-log as Medium priority.
Mon, Apr 26, 9:58 AM · Performance-Team, MediaWiki-Revision-backend, MW-1.37-notes (1.37.0-wmf.3; 2021-04-27), observability, SRE
akosiaris triaged T281019: Please Upload large files to Commons as Medium priority.
Mon, Apr 26, 9:58 AM · SRE, Wikimedia-Site-requests, Internet-Archive
akosiaris triaged T281004: grafana-rw SSO redirect breaks template parameters due to double encoding as Medium priority.
Mon, Apr 26, 9:58 AM · CAS-SSO, SRE, observability
akosiaris triaged T280996: See if we can drop the extra lists.wikimedia.org in mailman3 URLs as Medium priority.
Mon, Apr 26, 9:58 AM · SRE, Wikimedia-Mailing-lists

Fri, Apr 23

akosiaris added a comment to T280718: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive.

We could run one systemd timer on a random (canary) appserver and let it write the output to a local file.

Then we could use rsync::quickdatacopy and another timer on mwmaint* to pull that file over to where noc.wikimedia.org is hosted.

This would give an always current version of the file on actual MW appservers (not the same as mw* or deploy* necessarily, where deployers can already look themselves).

Fri, Apr 23, 6:03 PM · Patch-For-Review, serviceops-radar, SRE, Wikimedia-SVG-rendering
akosiaris added a comment to T280718: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive.

I 've gone through the various tasks (T280829, T79424, T210960 and T180923, let me know if I have missed others) and I 'd like to add the following points.

Fri, Apr 23, 5:56 PM · Patch-For-Review, serviceops-radar, SRE, Wikimedia-SVG-rendering
akosiaris merged T280829: Expose live font list (fc-list) on a public webpage into T280718: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive.
Fri, Apr 23, 5:33 PM · Patch-For-Review, serviceops-radar, SRE, Wikimedia-SVG-rendering
akosiaris merged task T280829: Expose live font list (fc-list) on a public webpage into T280718: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive.
Fri, Apr 23, 5:32 PM · SRE
akosiaris added a comment to T280829: Expose live font list (fc-list) on a public webpage.

Cool, done so, thanks!

Fri, Apr 23, 5:32 PM · SRE
akosiaris added a comment to T280829: Expose live font list (fc-list) on a public webpage.

ACK!

It's kind of a duplicate of T280718 before that was renamed at least.

Fri, Apr 23, 5:24 PM · SRE
akosiaris added a comment to T238909: Proposal: simplify set up of a new load-balanced service on kubernetes.
  • Simulate node failures and record/evaluate recovery times
Fri, Apr 23, 5:02 PM · SRE, Prod-Kubernetes, Pybal, Traffic, serviceops
akosiaris updated subscribers of T280622: Determine safe concurrent puppet run batches via cumin.

@akosiaris thanks for digging into this a bit further, and appolagise for not leaving more then a drive by comment:

How long did the run take? My reading of the graphs says ~40m (from 12:00 to 12:38), is that correct?

This sounds about right to me but unfortunately i don't have anything more precise still in my terminal

There are 3 different CPU/load/network peaks (all correlating with each other, I wonder why. Any ideas?

Nothing very concrete but. I did a run yesterday using -b 120 (which is definitely too high. The run started at about 10:52:16 and i cancelled it at 11:33:37. In the graphs here you can see similar peaks towards the beginning then after the second peak we things look like they calm down. however in fact at this point all puppet runs started to fail. Looking in the logs you see the occasional 400 and icinga also alerted a couple of times on Apache being unavailable.

My working theory is that the servers get a bit burst of requests start compiling, and then serving files to the first batch then at some point the stress gets too much and apache/passanger starts to ask users to back off. It seems that the agents at this point implements some type of backoff retry algorithm to continue trying to fetch the files/catalogue/submit facts etc. This ties up the connections but not necessarily cpu as they are not compiling catalogues. In icinga the errors look like either 0 resources received or unable to fetch File[/foo]. On the agent side one will notice agent runs starting to take much longer to seemingly hanging indefinitely.

Fri, Apr 23, 4:41 PM · SRE, Puppet
akosiaris triaged T280961: Degraded RAID on ms-be1019 as Low priority.
Fri, Apr 23, 12:20 PM · SRE, ops-eqiad
akosiaris triaged T280893: After lists have been migrated, https://lists.wikimedia.org/mailman/listinfo/<listname> should redirect to postorius as Medium priority.
Fri, Apr 23, 12:19 PM · SRE, Wikimedia-Mailing-lists
akosiaris moved T280881: New Service Request Toolhub from Inbox to Externally blocked on the Service-deployment-requests board.
Fri, Apr 23, 12:19 PM · Toolhub, Service-deployment-requests, Services, SRE
akosiaris triaged T280881: New Service Request Toolhub as Medium priority.
Fri, Apr 23, 12:19 PM · Toolhub, Service-deployment-requests, Services, SRE
akosiaris triaged T280887: Upgrade lists-next to bullseye mailman versions as Medium priority.
Fri, Apr 23, 12:19 PM · SRE, DBA, Wikimedia-Mailing-lists
akosiaris triaged T280967: Requesting access to Wikimedia Analytics Data for Aisha Khatun as Medium priority.
Fri, Apr 23, 12:13 PM · SRE, SRE-Access-Requests
akosiaris added a comment to T280967: Requesting access to Wikimedia Analytics Data for Aisha Khatun.

Hi Aisha. There is not such thing as All as far as groups go. Could you please clarify what exactly you are requesting access to?

Fri, Apr 23, 12:13 PM · SRE, SRE-Access-Requests
akosiaris triaged T280926: Various errors when trying to upload large files (Could not acquire lock, Service Temporarily Unavailable, 503 Backend fetch failed, 502 Next Hop Connection Failed) as Medium priority.
Fri, Apr 23, 12:05 PM · Structured-Data-Backlog (Current Work), SRE, Structured Data Engineering, Traffic, MediaWiki-Uploading, Commons, Wikimedia-production-error
akosiaris closed T280541: Requesting access to Wikimedia Analytics Data for Silvan Heintze as Resolved.

@Silvan_WMDE, Hi! The change expanding your access has been merged. Give it 30m or so to fully propagate and try it out. I 'll resolve the task but feel free to reopen if any issues arise. Thanks!

Fri, Apr 23, 11:58 AM · SRE, SRE-Access-Requests
akosiaris updated the task description for T280541: Requesting access to Wikimedia Analytics Data for Silvan Heintze.
Fri, Apr 23, 11:56 AM · SRE, SRE-Access-Requests
akosiaris updated the task description for T280541: Requesting access to Wikimedia Analytics Data for Silvan Heintze.
Fri, Apr 23, 11:53 AM · SRE, SRE-Access-Requests
akosiaris updated the task description for T280541: Requesting access to Wikimedia Analytics Data for Silvan Heintze.
Fri, Apr 23, 11:47 AM · SRE, SRE-Access-Requests
akosiaris added a comment to T238909: Proposal: simplify set up of a new load-balanced service on kubernetes.

Very cool!

  • Look into switching to "externalTrafficPolicy":"Local" in order to avoid the 2 layer load balancing

Curious to hear about that. AIUI this only works for type NodePort or LoadBalancer services. Not ClusterIP ones but I'm unsure if it works with a NodePort service talked to via the ClusterIP.

Fri, Apr 23, 7:44 AM · SRE, Prod-Kubernetes, Pybal, Traffic, serviceops
akosiaris committed rOHPU7d7f2bd08ee4: Enable per flow ECMP for kubernetes/kubestage (authored by akosiaris).
Enable per flow ECMP for kubernetes/kubestage
Fri, Apr 23, 7:28 AM

Wed, Apr 21

akosiaris triaged T280668: Degraded RAID on cloudvirt1018 as High priority.
Wed, Apr 21, 4:41 PM · SRE, ops-eqiad
akosiaris triaged T280782: thanos-fe2001 machine check exception and crash/stall as High priority.
Wed, Apr 21, 4:41 PM · SRE, ops-codfw
akosiaris awarded T280579: New service request: WDQS Flink based Streaming Updater a Love token.
Wed, Apr 21, 4:29 PM · Discovery-Search (Current work), wdwb-tech, SRE, Services, Wikidata, Service-deployment-requests, Wikidata-Query-Service
akosiaris triaged T228591: Document how to request installing additional fonts for SVG thumbnails and generated PDF files on Wikimedia servers as Low priority.

Specifically regarding https://noc.wikimedia.org/conf/fc-list I 've posted https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/681665 . It's looks like it's cruft we probably should not be keeping around (happy to hear otherwise though). So maybe I 'll solve 1 of the bullet points of the task.

Wed, Apr 21, 4:26 PM · SRE, Wikimedia-General-or-Unknown, Documentation, Wikimedia-SVG-rendering
akosiaris closed T274749: Requesting access to stat boxes for mlitn as Resolved.

@matthiasmullie access has been granted. It will take ~30 minutes to fully propagate but otherwise, on our end you are good. I 'll resolve this, but feel free to reopen if any issues arise. Thanks!

Wed, Apr 21, 4:21 PM · SRE, SRE-Access-Requests
akosiaris closed T280162: NDA for Superset Request from WMDE Employee Manuel as Resolved.

User added to wmde and nda ldap groups. @Manuel, I am resolving this task, feel free to reopen if any issues with your access arise.

Wed, Apr 21, 4:15 PM · SRE, LDAP-Access-Requests
akosiaris added a comment to T279244: CAS SSO for reedy.

Hi @Reedy, given the discussion in the task, do you reckon you still need racktables access? Or should be close this as instead?

Wed, Apr 21, 4:06 PM · CAS-SSO, SRE, LDAP-Access-Requests
akosiaris added a comment to T280625: Kubernetes packages in Debian Bullseye.

@akosiaris to be clear, I didn't mean that ML would have used it to bypass Service ops, it was only to bring up the subject to get opinions, +1 to decline it after what has been said :)

Wed, Apr 21, 4:02 PM · SRE, Machine-Learning-Team, serviceops
akosiaris added a comment to T277629: Create new group for root access to snapshot*, dumpsdata* and labstore1006,7 with holger in it.

Any news on this?

Wed, Apr 21, 3:58 PM · SRE, SRE-Access-Requests, Dumps-Generation
akosiaris closed T279310: Need access to noc@wikimedia.org (associated with Analytics' MaxMind account) as Resolved.

@JLaytonWMF I am gonna tentatively resolve this task, it looks like the matter is out of SRE hands and maxmind should be contacted directly to sales@maxmind.com. Feel free to reopen if we can somehow be of assistance.

Wed, Apr 21, 3:57 PM · SRE, SRE-Access-Requests
akosiaris added a comment to T280439: File:Chessboard480.svg WEBP thumbnail version not visible on safari when size is fixed at 208px.

wow, TIL. Thanks for that hint @ema.

Wed, Apr 21, 3:55 PM · Traffic, SRE, MediaWiki-General, Browser-Support-Apple-Safari
akosiaris triaged T280541: Requesting access to Wikimedia Analytics Data for Silvan Heintze as Medium priority.
Wed, Apr 21, 12:45 PM · SRE, SRE-Access-Requests
akosiaris added a comment to T280718: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive.

3 commits

Wed, Apr 21, 12:41 PM · Patch-For-Review, serviceops-radar, SRE, Wikimedia-SVG-rendering
akosiaris renamed T280718: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive from keep https://noc.wikimedia.org/conf/fc-list up-to-date to Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive.
Wed, Apr 21, 12:39 PM · Patch-For-Review, serviceops-radar, SRE, Wikimedia-SVG-rendering
akosiaris triaged T280731: Implement static redirects from pipermail archives to hyperkitty archives as Medium priority.
Wed, Apr 21, 12:34 PM · Patch-For-Review, SRE, Wikimedia-Mailing-lists
akosiaris added a project to T280439: File:Chessboard480.svg WEBP thumbnail version not visible on safari when size is fixed at 208px: Traffic.

So a successful fetch per safari, of 100 bytes per Content-Length. Interestingly, my tests are almost identical HTTP headers wise. We share almost all headers, minus Date and a minor diff in x-cache (I got hit/213). And yet from what I understand the content is garbage/garbled or something similar. I am gonna add @ema and Traffic on this one. It brings back memories of T266373

Wed, Apr 21, 12:33 PM · Traffic, SRE, MediaWiki-General, Browser-Support-Apple-Safari
akosiaris triaged T280579: New service request: WDQS Flink based Streaming Updater as Medium priority.
Wed, Apr 21, 11:21 AM · Discovery-Search (Current work), wdwb-tech, SRE, Services, Wikidata, Service-deployment-requests, Wikidata-Query-Service
akosiaris closed T280625: Kubernetes packages in Debian Bullseye as Declined.

I am gonna tentatively for now decline this. While serviceops wouldn't block Machine-Learning-Team if they wanted to utilize bullseye debian packages, for at least our use cases it's a bit premature.

Wed, Apr 21, 11:19 AM · SRE, Machine-Learning-Team, serviceops
akosiaris triaged T280623: Can't access thanos-fe1001.mgmt as Medium priority.
Wed, Apr 21, 11:17 AM · SRE, ops-eqiad
akosiaris triaged T280582: Shrink redis_sessions cluster as Medium priority.
Wed, Apr 21, 11:16 AM · Performance-Team (Radar), User-jijiki, SRE, serviceops
akosiaris triaged T280622: Determine safe concurrent puppet run batches via cumin as Low priority.

FYI i ran puppet fleet wide today using a batch size of 40 and there was no issue. puppet master load rose from ~1.5 -> 4.0. you can see a small peak in grafana. From this we should be able to go a bit higher

Wed, Apr 21, 11:15 AM · SRE, Puppet
akosiaris triaged T280527: payments1006.frack.eqiad.wmnet DRAC no console output as High priority.
Wed, Apr 21, 11:04 AM · fundraising-tech-ops, SRE, DC-Ops, ops-eqiad
akosiaris closed T280473: mail.wikimedia.org doesn't redirect to lists.wikimedia.org as Resolved.

From what I gather, we are on board with removing it, so resolving this in favor of tracking the work in T280472. Feel free to reopen.

Wed, Apr 21, 11:04 AM · SRE, Wikimedia-Mailing-lists
akosiaris triaged T280472: Figure out if we can remove legacy domain support for mailing lists as Medium priority.
Wed, Apr 21, 11:02 AM · Patch-For-Review, SRE, Wikimedia-Mailing-lists
akosiaris triaged T280439: File:Chessboard480.svg WEBP thumbnail version not visible on safari when size is fixed at 208px as Low priority.

Triaging as low until we can have an easy reproduction scenario.

Wed, Apr 21, 10:56 AM · Traffic, SRE, MediaWiki-General, Browser-Support-Apple-Safari
akosiaris added a comment to T280439: File:Chessboard480.svg WEBP thumbnail version not visible on safari when size is fixed at 208px.

Is this just safari on iOS and Mac? This works for me (at least on 1 try) on:

Wed, Apr 21, 10:55 AM · Traffic, SRE, MediaWiki-General, Browser-Support-Apple-Safari
akosiaris triaged T280432: Adding new font for CJK media display as Low priority.
Wed, Apr 21, 10:46 AM · SRE, Wikimedia-SVG-rendering
akosiaris changed the status of T280408: Create a mailing list for ptwikinews from Open to Stalled.

Stalling for a couple of weeks per above comments

Wed, Apr 21, 10:45 AM · User-Ladsgroup, SRE, Wikimedia-Mailing-lists
akosiaris triaged T280322: Upgrade mailing lists from mailman2 to 3 in batches as Medium priority.
Wed, Apr 21, 10:44 AM · User-Ladsgroup, SRE, Wikimedia-Mailing-lists
akosiaris updated subscribers of T238909: Proposal: simplify set up of a new load-balanced service on kubernetes.

And with the merge and deploy of the above we got:

Wed, Apr 21, 9:49 AM · SRE, Prod-Kubernetes, Pybal, Traffic, serviceops
akosiaris committed rOHPU9546198ad624: Add kubernetes service IP ranges to prefix list (authored by akosiaris).
Add kubernetes service IP ranges to prefix list
Wed, Apr 21, 9:16 AM

Tue, Apr 20

akosiaris added a comment to T279100: Have some dedicated jobrunners that aren't active videoscalers.

So, the crux of the issue is at those 2 functions below

Tue, Apr 20, 10:25 AM · Patch-For-Review, WMF-JobQueue, Sustainability (Incident Followup), SRE, serviceops
akosiaris reopened T279100: Have some dedicated jobrunners that aren't active videoscalers as "Open".

Reopening. The bug that @jeena reported in T279100#7000270 is reproducable. It seems that safe-service-restart won't take the previous state (pool/depooled) of a resource into account when verifying that everything is pooled.

Tue, Apr 20, 10:21 AM · Patch-For-Review, WMF-JobQueue, Sustainability (Incident Followup), SRE, serviceops
akosiaris added a comment to T280625: Kubernetes packages in Debian Bullseye.

For the "main" set of clusters, we have devised a plan to adopt upstream binaries but without relying on their repos. The Policy as well as the reasoning for that (the various versioning interdepency requirements we are forced to honor) is documented at https://wikitech.wikimedia.org/wiki/Kubernetes/Kubernetes_Infrastructure_upgrade_policy#Using_existing_upstream_binaries and applies to kubernetes as well as calico or helm. The implementation is at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/debs/kubernetes/+/refs/heads/future/debian/get-kubernetes-release.sh and it essentially fetches the .tar.gz of the binaries and stuffs them into packages fit for our infrastructure.

Tue, Apr 20, 8:49 AM · SRE, Machine-Learning-Team, serviceops

Mon, Apr 19

akosiaris claimed T253058: DRY kafka broker declaration in helmfiles.
Mon, Apr 19, 4:22 PM · serviceops, SRE, Patch-For-Review, Event-Platform, Analytics
akosiaris added a comment to T253058: DRY kafka broker declaration in helmfiles.

Adopting the new functionality in networkpolicy resources has indeed created some tech debt. It's a tech debt we created on purpose while devoting resources to finalize the migration away from the old way of maintaining those networkpolicies. Now that that's gone, I want to revisit it and deduplicate it as much as possible.

Mon, Apr 19, 4:07 PM · serviceops, SRE, Patch-For-Review, Event-Platform, Analytics
akosiaris added a comment to T278083: Define SLIs/SLOs for link recommendation service.

One thing that I forgot to point out. Given that the internal vs the external services have different audiences, it probably makes sense to also come up with different SLOs, as the requirements will be different.

Mon, Apr 19, 1:15 PM · Growth-Team (Current Sprint), Add-Link
akosiaris updated subscribers of T278083: Define SLIs/SLOs for link recommendation service.

SRE Service Ops will provide information and a walk through for this.

@akosiaris scheduling a meeting with the relevant stakeholders might be difficult; can we do this asynchronously?

Mon, Apr 19, 11:55 AM · Growth-Team (Current Sprint), Add-Link