Page MenuHomePhabricator

Joe (Giuseppe Lavagetto)
Spy

Projects (21)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 5:57 AM (477 w, 6 d)
Availability
Available
LDAP User
Giuseppe Lavagetto
MediaWiki User
GLavagetto (WMF) [ Global Accounts ]

Recent Activity

Today

Joe added a comment to T352003: Create a dedicated image for Debian package builds.

These images should be in the base images repository at operations/docker-images/production-images that gets built via docker-pkg as part of our pipeline.

Thu, Nov 30, 10:41 AM · collaboration-services

Yesterday

Joe added a comment to T352245: Migrate etcd::tlsproxy Nginx certs and etcd itself to PKI.

When we make the change, it will require a restart of etcd on the nodes.

Wed, Nov 29, 10:00 AM · serviceops
Joe closed T352156: PHP Warning: geoip_country_code_by_name(): Required database not available at /usr/share/GeoIP/GeoIP.dat. as Resolved.

I created the subtasks, assigned the one about WikimediaEvents to @kostajh.

Wed, Nov 29, 9:09 AM · MW-1.42-notes (1.42.0-wmf.9; 2023-12-12), LandingCheck, Patch-For-Review, MediaWiki-extensions-WikimediaEvents, Release-Engineering-Team, MW-on-K8s, serviceops, Wikimedia-production-error
Joe placed T352247: Remove calls to GeoIP 1 in Extension:LandingCheck up for grabs.
Wed, Nov 29, 9:08 AM · LandingCheck, Wikimedia-production-error
Joe created T352248: Remove calls to GeoIP 1 in Extension:WikimediaEvents.
Wed, Nov 29, 9:07 AM · MediaWiki-extensions-WikimediaEvents, Wikimedia-production-error
Joe created T352247: Remove calls to GeoIP 1 in Extension:LandingCheck.
Wed, Nov 29, 9:01 AM · LandingCheck, Wikimedia-production-error
Joe reopened T352156: PHP Warning: geoip_country_code_by_name(): Required database not available at /usr/share/GeoIP/GeoIP.dat. as "Open".

The task is not resolved until both uses of the old geoip library aren't removed. We can either leave this UBN! open (and assign it to the maintainers of the libraries) or create UBN! subtasks for those.

Wed, Nov 29, 8:35 AM · MW-1.42-notes (1.42.0-wmf.9; 2023-12-12), LandingCheck, Patch-For-Review, MediaWiki-extensions-WikimediaEvents, Release-Engineering-Team, MW-on-K8s, serviceops, Wikimedia-production-error
Joe added a comment to T352156: PHP Warning: geoip_country_code_by_name(): Required database not available at /usr/share/GeoIP/GeoIP.dat..

https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/978031 would remove the code from WikimediaEvents, which is probably OK to get the train moving forward?

That was an infrastructure issue (the old legacy GeoIP data were no more available) rather than a code issue. https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/978031 should be reverted to restore GeoIP coding in producing for WikimediaEvents.

Wed, Nov 29, 8:32 AM · MW-1.42-notes (1.42.0-wmf.9; 2023-12-12), LandingCheck, Patch-For-Review, MediaWiki-extensions-WikimediaEvents, Release-Engineering-Team, MW-on-K8s, serviceops, Wikimedia-production-error

Tue, Nov 28

Joe added a comment to T352156: PHP Warning: geoip_country_code_by_name(): Required database not available at /usr/share/GeoIP/GeoIP.dat..

The problem is that the new kubernetes nodes don't have a copy of the .dat files... because those files have been discontinued in April 2022 and we should've converted any use of such files a long time ago, per T269475, using the geoip2/geoip2 library instead.

The problem was introduced in https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/WikimediaEvents/+/baaf1182661d6430991499f4a54e9da6fb4f061b which happened well after we should have discontinued using geoipv1

We can re-provide those files, but I think we should instead fix the code, here and anywhere else we're still using the old geoip module

I would go with reproviding the geoip files since their removal has lead to code being broken. I guess remaining usage of the legacy geoip_country_code_by_name() could have been investigated and migrated before the removal, potentially with a guard in CI to prevent from being reinstated :)

Tue, Nov 28, 2:40 PM · MW-1.42-notes (1.42.0-wmf.9; 2023-12-12), LandingCheck, Patch-For-Review, MediaWiki-extensions-WikimediaEvents, Release-Engineering-Team, MW-on-K8s, serviceops, Wikimedia-production-error
Joe added a comment to T352156: PHP Warning: geoip_country_code_by_name(): Required database not available at /usr/share/GeoIP/GeoIP.dat..

codesearch also says the same code more or less is also in https://gerrit.wikimedia.org/g/mediawiki/extensions/LandingCheck/+/9977828e2e4c4ce1ef88384807192ed03fae4a37/includes/SpecialLandingCheck.php#108, which should also be amended to use geoip2 I guess.

Tue, Nov 28, 11:25 AM · MW-1.42-notes (1.42.0-wmf.9; 2023-12-12), LandingCheck, Patch-For-Review, MediaWiki-extensions-WikimediaEvents, Release-Engineering-Team, MW-on-K8s, serviceops, Wikimedia-production-error
Joe added a comment to T352156: PHP Warning: geoip_country_code_by_name(): Required database not available at /usr/share/GeoIP/GeoIP.dat..

The problem is that the new kubernetes nodes don't have a copy of the .dat files... because those files have been discontinued in April 2022 and we should've converted any use of such files a long time ago, per T269475, using the geoip2/geoip2 library instead.

Tue, Nov 28, 11:18 AM · MW-1.42-notes (1.42.0-wmf.9; 2023-12-12), LandingCheck, Patch-For-Review, MediaWiki-extensions-WikimediaEvents, Release-Engineering-Team, MW-on-K8s, serviceops, Wikimedia-production-error

Mon, Nov 27

Joe closed T345970: Deploy StatsD exporter for Kubernetes, a subtask of T343023: Deploy StatsD Exporter to production, as Resolved.
Mon, Nov 27, 10:16 AM · SRE Observability (FY2023/2024-Q2), User-herron, Observability-Metrics
Joe closed T345970: Deploy StatsD exporter for Kubernetes as Resolved.

Apologies, I didn't realize there was this additional task re: statsd on k8s, and I've attached the patches to T343025

Mon, Nov 27, 10:16 AM · SRE Observability (FY2023/2024-Q2), MW-on-K8s, serviceops, User-herron, Observability-Metrics

Wed, Nov 22

Joe edited P53668 (An Untitled Masterwork).
Wed, Nov 22, 8:48 AM

Tue, Nov 21

Joe closed T285806: Document communication expectations around planning a DC switchover as Resolved.

Given we now have switchovers at regular intervals, we can resolve this task. There is no need to do a lot of communication around it.

Tue, Nov 21, 8:28 AM · serviceops, SRE, Datacenter-Switchover

Mon, Nov 20

Joe added a comment to T350846: Migrate mobileapps to k8s.

As you might have noticed by the patches here, we've pivoted as traffic splitting to the canaries via kube-proxy converges over hours, not seconds which is what we'll need when we start raising our percentages.

Mon, Nov 20, 2:09 PM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s

Thu, Nov 16

Joe added a comment to T347366: Follow-up on wmf-config "ClusterConfig::isTest" method.

I would suggest to actually "fix" mediawiki-config, as changing cluster in puppet is not as straightforward as we expected.

Thu, Nov 16, 7:04 AM · Patch-For-Review, MediaWiki-Engineering-Group-onboarding, Wikimedia-Site-requests, MediaWiki-Platform-Team

Mon, Nov 13

Jdforrester-WMF awarded T351074: Move servers from the appserver/api cluster to kubernetes a Yellow Medal token.
Mon, Nov 13, 3:06 PM · Patch-For-Review, serviceops, MW-on-K8s
Joe added a comment to T350846: Migrate mobileapps to k8s.

I decided we should move about 10% of the mobileapps traffic at a time; that means about 300 rps, which I think we should be able to serve moving over about 2-3 api servers to become k8s nodes, or an additional 15 pods

Mon, Nov 13, 12:18 PM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Joe triaged T350846: Migrate mobileapps to k8s as High priority.
Mon, Nov 13, 10:38 AM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Joe created T351074: Move servers from the appserver/api cluster to kubernetes.
Mon, Nov 13, 10:37 AM · Patch-For-Review, serviceops, MW-on-K8s

Thu, Nov 9

Joe added a comment to T350846: Migrate mobileapps to k8s.

As it's clear from the patches, I chose to take the sage advice of @JMeybohm and go down the path of least resistance :)

Thu, Nov 9, 3:23 PM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Joe added a comment to T346657: Requests originating from zhwiki wikifeeds caused parsoid outage.

And good news, most requests to the endpoint now take 50-100ms to get a response, instead than 5-10 seconds. This is much more sustainable.

Thu, Nov 9, 11:46 AM · MW-1.42-notes (1.42.0-wmf.3; 2023-10-31), Maintenance-Worktype, serviceops, Content-Transform-Team-WIP, Chinese-Sites, RESTBase, Parsoid
Joe added a comment to T346657: Requests originating from zhwiki wikifeeds caused parsoid outage.

Given the patch is now live with the latest train, I've disabled the rule for now.

Thu, Nov 9, 11:44 AM · MW-1.42-notes (1.42.0-wmf.3; 2023-10-31), Maintenance-Worktype, serviceops, Content-Transform-Team-WIP, Chinese-Sites, RESTBase, Parsoid
Joe closed T350645: feed/featured endpoint is showing error on zhwiki as Resolved.

Tentatively resolving for now, but this might go back to banned if the underlying problems aren't solved with the latest patches.

Thu, Nov 9, 11:40 AM · Wikipedia-Android-App-Backlog (Android Release - FY2023-24), Chinese-Sites, RESTBase, Wikipedia-iOS-App-Backlog
Joe added a comment to T350645: feed/featured endpoint is showing error on zhwiki.

@hnowlan is this related to a REST Gateway change?

Seems unlikely, the REST gateway doesn't autonomously issue 401s generally. Is this a hangover from T346657?

Thu, Nov 9, 11:35 AM · Wikipedia-Android-App-Backlog (Android Release - FY2023-24), Chinese-Sites, RESTBase, Wikipedia-iOS-App-Backlog
Joe added a comment to T350846: Migrate mobileapps to k8s.

Couldn't we just add another mobileapps release (like a canary) that connects to mw-api-int and scale that up slowly while scaling the existing one down? That would not require any envoy config patching

Thu, Nov 9, 11:07 AM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Joe claimed T350846: Migrate mobileapps to k8s.
Thu, Nov 9, 8:23 AM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Joe updated the task description for T350846: Migrate mobileapps to k8s.
Thu, Nov 9, 8:23 AM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Joe created T350846: Migrate mobileapps to k8s.
Thu, Nov 9, 8:23 AM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s

Wed, Nov 8

Joe closed T344478: Fix how we keep docker-pkg based images up to date as Resolved.
Wed, Nov 8, 3:03 PM · Release Pipeline (Blubber), docker-pkg, serviceops
Joe closed T350770: Incorrect php redirect in some virtualhosts in mw on k8s as Resolved.
Wed, Nov 8, 2:29 PM · serviceops, MW-on-K8s
Joe triaged T350770: Incorrect php redirect in some virtualhosts in mw on k8s as High priority.
Wed, Nov 8, 9:18 AM · serviceops, MW-on-K8s
Joe created T350770: Incorrect php redirect in some virtualhosts in mw on k8s.
Wed, Nov 8, 9:18 AM · serviceops, MW-on-K8s

Tue, Nov 7

Joe added a comment to T223413: Broken (empty) cross-wiki notification when using $wgLocalHTTPProxy (e.g. on Kubernetes).

For me, I am in USA, and the bug for MediaWiki.org notifications trying to display enwiki notifications went away about a month ago. Was seeing it consistently before, and haven't seen it since. Hope this info helps.

Tue, Nov 7, 5:59 PM · serviceops, MW-on-K8s, Growth-Team-Filtering, Growth-Team, Notifications
Joe updated the task description for T350366: Multiple images fail to build from sources.
Tue, Nov 7, 4:09 PM · serviceops
Ladsgroup awarded T349796: Move MediaWiki jobs to mw-on-k8s a Love token.
Tue, Nov 7, 1:07 PM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Joe added a comment to T223413: Broken (empty) cross-wiki notification when using $wgLocalHTTPProxy (e.g. on Kubernetes).
Tue, Nov 7, 12:01 PM · serviceops, MW-on-K8s, Growth-Team-Filtering, Growth-Team, Notifications

Mon, Nov 6

Joe added a comment to T348122: Move 25% of mediawiki external requests to mw on k8s.

The Kubernetes work so far has caused problems with cross-wiki Echo notifications (see T223413, T342201). Please help resolve this before further rollouts. Thanks!

Mon, Nov 6, 3:16 PM · Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Joe added a comment to T223413: Broken (empty) cross-wiki notification when using $wgLocalHTTPProxy (e.g. on Kubernetes).

@matmarex do you have a way to verify if the bug still presents itself? It's slightly hard for me as I mostly edit mediawiki.org or wikitech.

Mon, Nov 6, 3:14 PM · serviceops, MW-on-K8s, Growth-Team-Filtering, Growth-Team, Notifications
Joe closed T342201: MediaWiki\Extension\Notifications\Api\ApiEchoUnreadNotificationPages::getUnreadNotificationPagesFromForeign: Unexpected API response from {wiki} as Resolved.
Mon, Nov 6, 3:13 PM · Growth-Team, serviceops, SRE, MediaWiki-Platform-Team, MW-1.41-notes (1.41.0-wmf.20; 2023-08-01), MediaWiki-extensions-CentralAuth, MW-on-K8s, Notifications, Wikimedia-production-error
Joe added a comment to T223413: Broken (empty) cross-wiki notification when using $wgLocalHTTPProxy (e.g. on Kubernetes).

I suspect the fix I made for T342201 actually might have solved this issue as well. Not sure how to verify it though.

Mon, Nov 6, 2:36 PM · serviceops, MW-on-K8s, Growth-Team-Filtering, Growth-Team, Notifications
Joe added a comment to T342201: MediaWiki\Extension\Notifications\Api\ApiEchoUnreadNotificationPages::getUnreadNotificationPagesFromForeign: Unexpected API response from {wiki}.

Since my deployment of this change, MediaWiki\Extension\Notifications\Api\ApiEchoUnreadNotificationPages::getUnreadNotificationPagesFromForeign: Unexpected API response from {wiki} messages seem to have disappeared. I'll wait this evening and resolve the task.

Mon, Nov 6, 2:35 PM · Growth-Team, serviceops, SRE, MediaWiki-Platform-Team, MW-1.41-notes (1.41.0-wmf.20; 2023-08-01), MediaWiki-extensions-CentralAuth, MW-on-K8s, Notifications, Wikimedia-production-error
Joe added a comment to T342201: MediaWiki\Extension\Notifications\Api\ApiEchoUnreadNotificationPages::getUnreadNotificationPagesFromForeign: Unexpected API response from {wiki}.

When mcrouter-primary-dc is selected, we set the routing prefix to /$wmgMasterDatacenter/mw/, so writes are consistently directed there. I just found a bug in the configuration of mcrouter on k8s that should solve the issue once I've fixed it.

Mon, Nov 6, 1:01 PM · Growth-Team, serviceops, SRE, MediaWiki-Platform-Team, MW-1.41-notes (1.41.0-wmf.20; 2023-08-01), MediaWiki-extensions-CentralAuth, MW-on-K8s, Notifications, Wikimedia-production-error
Joe added a comment to T223413: Broken (empty) cross-wiki notification when using $wgLocalHTTPProxy (e.g. on Kubernetes).

I can't imagine why calling the primary datacenter would be a problem in this case, unless there is a logical race condition or, worse, we completely rely on data being in memcached and/or any other datastore that's not replicated.

Mon, Nov 6, 11:09 AM · serviceops, MW-on-K8s, Growth-Team-Filtering, Growth-Team, Notifications
Joe created T350565: Switch conftool to use the version 3 etcd datastore.
Mon, Nov 6, 8:28 AM · conftool, Infrastructure-Foundations, Data-Persistence, Traffic, serviceops
Joe changed the visibility for F41456021: image.png.
Mon, Nov 6, 8:22 AM

Thu, Nov 2

Joe closed T350423: Allow access to WMF-NDA tasks on phabricator to Shaun Spalding as Resolved.

I already added Shaun and I'm just filing the task retroactively to keep track of the action properly.

Thu, Nov 2, 5:15 PM · WMF-NDA-Requests
Joe created T350423: Allow access to WMF-NDA tasks on phabricator to Shaun Spalding.
Thu, Nov 2, 5:14 PM · WMF-NDA-Requests
Joe added a member for WMF-NDA: SSpalding-WMF.
Thu, Nov 2, 5:07 PM
Joe updated the task description for T350366: Multiple images fail to build from sources.
Thu, Nov 2, 3:53 PM · serviceops
Joe updated the task description for T350366: Multiple images fail to build from sources.
Thu, Nov 2, 3:10 PM · serviceops
Joe updated subscribers of T350366: Multiple images fail to build from sources.

The error in the loki build is due to its dependency on the fact it depends on golang-1.13 which has been dismissed years ago. @colewhite do you think we can just remove the loki image instead?

Thu, Nov 2, 12:00 PM · serviceops
Joe updated the task description for T350366: Multiple images fail to build from sources.
Thu, Nov 2, 11:53 AM · serviceops
Joe claimed T350366: Multiple images fail to build from sources.
Thu, Nov 2, 11:15 AM · serviceops
Joe triaged T350366: Multiple images fail to build from sources as High priority.
Thu, Nov 2, 11:15 AM · serviceops
Joe created T350366: Multiple images fail to build from sources.
Thu, Nov 2, 11:14 AM · serviceops
Joe added a comment to T349376: EtcdConfig using stale data: lost lock in /srv/mediawiki/php-1.42.0-wmf.1/includes/config/EtcdConfig.php on line 218.

@Krinkle the instrumentation added doesn't distinguish between failing to get a lock because it's already taken by another thread and just a complete failure. That would help us understand if it's actually an issue or not.

Thu, Nov 2, 9:20 AM · serviceops, MediaWiki-Engineering
Joe closed T340935: Some apache access logs are invalid json as Resolved.
Thu, Nov 2, 7:29 AM · Observability-Logging, serviceops, MW-on-K8s
Joe added a comment to T340935: Some apache access logs are invalid json .

The change should now be live, I'm tentatively re-closing this task as I can only find truncated messages that are unparsed now, not any due to bad encoding.

Thu, Nov 2, 7:29 AM · Observability-Logging, serviceops, MW-on-K8s

Oct 31 2023

Joe committed rLPRI55d69ddd09eb: docker::builder: strings must be strings in yaml (authored by Joe).
docker::builder: strings must be strings in yaml
Oct 31 2023, 2:57 PM
Joe committed rLPRI02dbb8ac0259: Add fake ssh private key for docker::builder (authored by Joe).
Add fake ssh private key for docker::builder
Oct 31 2023, 2:49 PM
Joe triaged T350111: PKI system is unable to serve new certificates to debmonitor / other systems, causing puppet failures across the fleet. as Unbreak Now! priority.
Oct 31 2023, 8:20 AM · CFSSL-PKI, Infrastructure-Foundations
Joe created T350111: PKI system is unable to serve new certificates to debmonitor / other systems, causing puppet failures across the fleet..
Oct 31 2023, 8:19 AM · CFSSL-PKI, Infrastructure-Foundations

Oct 30 2023

Joe added a comment to T349918: Realm.pp loads before site.pp.

@Joe @akosiaris i wonder if either of you have a memory if this was intentional or just not considered?

Oct 30 2023, 9:34 AM · Puppet-Core, Infrastructure-Foundations

Oct 26 2023

Joe added a comment to T349823: [Event Platform] Gracefully handle pod termination in eventgate Helm chart.

The problem can also be that we have one component in front of the service (envoyproxy) that gets terminated immediately, while terminating the application on the backend takes more time, so connections are truncated immediately. We just added a pause before terminating envoy in mediawiki for example to overcome this problem.

Oct 26 2023, 3:06 PM · Data Engineering and Event Platform Team (Sprint 4), Data-Engineering, Event-Platform, serviceops
Joe added a comment to T344478: Fix how we keep docker-pkg based images up to date.

For now it would be enough for us to just get a gerrit account that we can use to:

  • submit and merge a change per week that adds a new changelog entry to all images in a repository as part of a weekly cron
  • build the production images after the change is merged
Oct 26 2023, 2:05 PM · Release Pipeline (Blubber), docker-pkg, serviceops
Joe added a comment to T344478: Fix how we keep docker-pkg based images up to date.

If your desire is having deterministic builds, it would be enough to NOT remove the apt archives from the base layer image, and then only run apt-get install at all other layers (unless you've added a new component), but I think this is a pretty specific need of CI images, to be honest. OS updates aren't usually problematic for anything else.

Oct 26 2023, 1:59 PM · Release Pipeline (Blubber), docker-pkg, serviceops
Joe triaged T349796: Move MediaWiki jobs to mw-on-k8s as High priority.
Oct 26 2023, 7:53 AM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Joe created T349796: Move MediaWiki jobs to mw-on-k8s.
Oct 26 2023, 7:49 AM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Joe closed T278220: Define the size of a pod for mediawiki in terms of resource usage as Resolved.

I would say this is resolved since a long time?

Oct 26 2023, 7:43 AM · serviceops, MW-on-K8s
Joe closed T278220: Define the size of a pod for mediawiki in terms of resource usage, a subtask of T277711: Memcached, mcrouter in MediaWiki on Kubernetes, as Resolved.
Oct 26 2023, 7:43 AM · serviceops, SRE
Joe moved T345970: Deploy StatsD exporter for Kubernetes from Backlog to Ready on the MW-on-K8s board.
Oct 26 2023, 7:42 AM · SRE Observability (FY2023/2024-Q2), MW-on-K8s, serviceops, User-herron, Observability-Metrics

Oct 25 2023

lmata awarded T343025: Identify path forward for k8s deployment of prometheus-statsd-exporter a Party Time token.
Oct 25 2023, 1:13 PM · SRE Observability (FY2023/2024-Q2), Patch-For-Review, serviceops, Observability-Metrics
Joe lowered the priority of T349671: Cannot upload on Commons or even here from Unbreak Now! to Medium.

Looking at the new file stream it looks like uploads work in general, so this is not an UBN! bug as far as SREs are concerned. I'd say that as far as we're concerned there seems to be no real generalized malfunction.

Oct 25 2023, 6:42 AM · Traffic, SRE
Joe closed T343025: Identify path forward for k8s deployment of prometheus-statsd-exporter as Resolved.

@colewhite prometheus statsd exporter can be now installed in mw on k8s with a configuration switch. Resolving this task for now.

Oct 25 2023, 6:33 AM · SRE Observability (FY2023/2024-Q2), Patch-For-Review, serviceops, Observability-Metrics
Joe closed T343025: Identify path forward for k8s deployment of prometheus-statsd-exporter, a subtask of T343020: Converting MediaWiki Metrics to StatsLib, as Resolved.
Oct 25 2023, 6:33 AM · SRE Observability (FY2023/2024-Q2), Observability-Metrics

Oct 23 2023

Joe added a comment to T344428: refreshUserImpactJob logs mysterious fatal errors.

Sorry for the silence, I was first at a conference then in bed sick (and I'm still not in a great health condition).

Oct 23 2023, 8:16 AM · Growth-Team (Sprint 1 (Growth Team)), serviceops, SRE, Performance Issue, GrowthExperiments-Homepage, GrowthExperiments-ImpactModule
Joe triaged T349376: EtcdConfig using stale data: lost lock in /srv/mediawiki/php-1.42.0-wmf.1/includes/config/EtcdConfig.php on line 218 as Medium priority.

Setting the priority to medium as we do clearly read correctly from etcd in k8s, else we would've noticed all sorts of other problems like db connections failing

Oct 23 2023, 7:38 AM · serviceops, MediaWiki-Engineering
Joe added a comment to T349376: EtcdConfig using stale data: lost lock in /srv/mediawiki/php-1.42.0-wmf.1/includes/config/EtcdConfig.php on line 218.

EtcdConfig uses eventually APCUBagOfStuff, see https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/refs/heads/master/wmf-config/etcd.php#29, so the lock is taken on APCU.

Oct 23 2023, 7:33 AM · serviceops, MediaWiki-Engineering

Oct 4 2023

Ladsgroup awarded T346422: Move 10% of mediawiki external requests to mw on k8s a Love token.
Oct 4 2023, 10:20 AM · serviceops, Kubernetes, Prod-Kubernetes

Sep 28 2023

Joe added a comment to T347493: Serve Wikidata traffic via Kubernetes.

Well turns out the issue was simpler: we even had a TODO in the code:

Sep 28 2023, 7:43 AM · [DEPRECATED] wdwb-tech, Wikidata, SRE, Traffic, serviceops, MW-on-K8s
Joe added a comment to T347493: Serve Wikidata traffic via Kubernetes.

I tried restarting ATS on a backend, cp1081, then made requests for wikidata's special:random to trafficserver directly: still all going to appservers on bare metal.

Sep 28 2023, 7:32 AM · [DEPRECATED] wdwb-tech, Wikidata, SRE, Traffic, serviceops, MW-on-K8s
Joe triaged T347544: Separate deployment for wikifunctions.org as High priority.
Sep 28 2023, 6:34 AM · Abstract Wikipedia team, SRE, Traffic, Wikifunctions, serviceops
Joe created T347544: Separate deployment for wikifunctions.org.
Sep 28 2023, 6:34 AM · Abstract Wikipedia team, SRE, Traffic, Wikifunctions, serviceops
Joe added a comment to T347493: Serve Wikidata traffic via Kubernetes.

Interestingly, I do get correct results for m.wikidata.org, but somehow not for www.wikidata.org (also, please grep for mw-web as we've repooled eqiad in the meantime).

Sep 28 2023, 6:15 AM · [DEPRECATED] wdwb-tech, Wikidata, SRE, Traffic, serviceops, MW-on-K8s
Joe added a comment to T347493: Serve Wikidata traffic via Kubernetes.

@Jdforrester-WMF no, this task is actually about that patch not having the effect we expected.

Sep 28 2023, 6:13 AM · [DEPRECATED] wdwb-tech, Wikidata, SRE, Traffic, serviceops, MW-on-K8s

Sep 27 2023

Joe committed rOSCT4f3655b85721: Release 2.3.2 (authored by Joe).
Release 2.3.2
Sep 27 2023, 11:29 AM
Joe committed rOSCTfeffc82029a5: Add X-Known-Client support (authored by Joe).
Add X-Known-Client support
Sep 27 2023, 11:29 AM

Sep 26 2023

Joe updated the task description for T333120: Migrate internal traffic to k8s.
Sep 26 2023, 1:14 PM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Joe closed T346447: Migrate wikifeeds to mw-api-int, a subtask of T333120: Migrate internal traffic to k8s, as Resolved.
Sep 26 2023, 1:14 PM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Joe closed T346447: Migrate wikifeeds to mw-api-int as Resolved.
Sep 26 2023, 1:14 PM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Joe updated the task description for T333120: Migrate internal traffic to k8s.
Sep 26 2023, 10:36 AM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Joe closed T346448: Migrate all eventgate installations to mw-api-int as Resolved.
Sep 26 2023, 10:36 AM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Joe closed T346448: Migrate all eventgate installations to mw-api-int, a subtask of T333120: Migrate internal traffic to k8s, as Resolved.
Sep 26 2023, 10:35 AM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s

Sep 25 2023

Joe added a comment to T343025: Identify path forward for k8s deployment of prometheus-statsd-exporter.

When you have your code in core, and we have merged the above patches, we can start testing your code specifically on k8s, first on the debug cluster, then everywhere.

Sep 25 2023, 10:54 AM · SRE Observability (FY2023/2024-Q2), Patch-For-Review, serviceops, Observability-Metrics

Sep 22 2023

Joe added a comment to T344428: refreshUserImpactJob logs mysterious fatal errors.

@Urbanecm_WMF as I said on IRC, there's two main differences when running in the jobqueue:

Sep 22 2023, 11:30 AM · Growth-Team (Sprint 1 (Growth Team)), serviceops, SRE, Performance Issue, GrowthExperiments-Homepage, GrowthExperiments-ImpactModule

Sep 21 2023

Joe changed the status of T343025: Identify path forward for k8s deployment of prometheus-statsd-exporter, a subtask of T343020: Converting MediaWiki Metrics to StatsLib, from Open to In Progress.
Sep 21 2023, 4:04 PM · SRE Observability (FY2023/2024-Q2), Observability-Metrics
Joe changed the status of T343025: Identify path forward for k8s deployment of prometheus-statsd-exporter from Open to In Progress.
Sep 21 2023, 4:04 PM · SRE Observability (FY2023/2024-Q2), Patch-For-Review, serviceops, Observability-Metrics
Joe added a comment to T346690: mcrouter daemonset on mw-on-k8s.

I pressed submit before finishing my comment:

Sep 21 2023, 2:37 PM · MediaWiki-Platform-Team, Patch-For-Review, serviceops, MW-on-K8s