Page MenuHomePhabricator

mediawiki: migrate from image-suggestion to data-gateway
Closed, ResolvedPublic

Description

The new data-gateway service is up (T364921) and should be ready to take over for the image-suggestion service currently used by mediawiki (along with supporting new use cases, such as commons-impact-analytics).

We will need to:

  • Update all Data Gateway and image-suggestion documentation to reflect the new state of the world
  • Migrate the ImageSuggestions & GrowthExperiments extension to use the data-gateway service listener
  • Remove the image-suggestion listener and references to it
  • Remove hard-coded references to the discovery service in the ImageSuggestions extension.json (r1179009)
  • Turn down the image-suggestion service - Note that this is k8s ingress service, so the procedure differs from (and is simpler than) turning down an LVS service.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

The data-gateway listener is now available (though unused) in production MediaWiki at localhost:6038.

One question that came up while reviewing the current state of configuration:

While it's clear that we need an updated entry in the config arrays in ProductionServices.php and LabsServices.php (the latter null'd out), and to wire it into wgGEImageRecommendationServiceUrl, it's unclear to me what's happening in the ImageSuggestions extension itself.

Specifically, its configuration appears to statically reference image-suggestion.discovery.wmnet, and those configuration keys don't appear to be overridden where the extension is enabled in mediawiki-config.

Are these keys unused in practice, or is the extension actually side-stepping the service mesh?

The data-gateway listener is now available (though unused) in production MediaWiki at localhost:6038.

One question that came up while reviewing the current state of configuration:

While it's clear that we need an updated entry in the config arrays in ProductionServices.php and LabsServices.php (the latter null'd out), and to wire it into wgGEImageRecommendationServiceUrl, it's unclear to me what's happening in the ImageSuggestions extension itself.

Specifically, its configuration appears to statically reference image-suggestion.discovery.wmnet, and those configuration keys don't appear to be overridden where the extension is enabled in mediawiki-config.

Are these keys unused in practice, or is the extension actually side-stepping the service mesh?

/cc: @Cparle ^^^^

Change #1171702 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/mediawiki-config@master] image-suggestion: reconfigure for data-gateway listener

https://gerrit.wikimedia.org/r/1171702

Change #1171703 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/mediawiki-config@master] image-suggestion: cleanup unused refs to service listener

https://gerrit.wikimedia.org/r/1171703

@Cparle I put together a couple of strawperson changesets for mediawiki-config (one for the switch-over, one to cleanup afterward), hopefully they're close enough to save you some time. The data-gateway service listener has been enabled/deployed, so this can be done now at any time. Of course I'm happy to be around when this happens and help if I can, so let me know.

The data-gateway listener is now available (though unused) in production MediaWiki at localhost:6038.

One question that came up while reviewing the current state of configuration:

While it's clear that we need an updated entry in the config arrays in ProductionServices.php and LabsServices.php (the latter null'd out), and to wire it into wgGEImageRecommendationServiceUrl, it's unclear to me what's happening in the ImageSuggestions extension itself.

Specifically, its configuration appears to statically reference image-suggestion.discovery.wmnet, and those configuration keys don't appear to be overridden where the extension is enabled in mediawiki-config.

Are these keys unused in practice, or is the extension actually side-stepping the service mesh?

This is confusing. I'm not sure what the ImageSuggestions Extension actually does. Is it related to image suggestions in the Apps?
Because the suggestions one sees on web / Special:Homepage are coming from GrowthExperiments (GE), and the wikis where GE's image-recommendations are enabled are only partially overlapping with the wikis where the ImageSuggestions Extension is enabled.

Also, as far as I can tell from code-search, wgGEImageRecommendationServiceUrl is not actually used by the ImageSuggestions extension, only by GrowthExperiments. (Hence the GE in the config name.)

That URL is then used in GrowthExperiments in this method ($this->url is the injected config value from wgGEImageRecommendationServiceUrl):

ProductionImageRecommendationApiHandler
	private function getRequest( array $pathArgs = [] ): MWHttpRequest {
		$request = $this->httpRequestFactory->create(
			$this->url . '/' . implode( '/', array_map( 'rawurlencode', $pathArgs ) ),
			[
				'method' => 'GET',
				'originalRequest' => RequestContext::getMain()->getRequest(),
				'timeout' => $this->requestTimeout,
				'sslVerifyCert' => $this->shouldVerifySsl,
				'sslVerifyHost' => $this->shouldVerifySsl,
			],
			__METHOD__
		);
		$request->setHeader( 'Accept', 'application/json' );
		return $request;
	}

The data-gateway listener is now available (though unused) in production MediaWiki at localhost:6038.

One question that came up while reviewing the current state of configuration:

While it's clear that we need an updated entry in the config arrays in ProductionServices.php and LabsServices.php (the latter null'd out), and to wire it into wgGEImageRecommendationServiceUrl, it's unclear to me what's happening in the ImageSuggestions extension itself.

Specifically, its configuration appears to statically reference image-suggestion.discovery.wmnet, and those configuration keys don't appear to be overridden where the extension is enabled in mediawiki-config.

Are these keys unused in practice, or is the extension actually side-stepping the service mesh?

This is confusing. I'm not sure what the ImageSuggestions Extension actually does. Is it related to image suggestions in the Apps?
Because the suggestions one sees on web / Special:Homepage are coming from GrowthExperiments (GE), and the wikis where GE's image-recommendations are enabled are only partially overlapping with the wikis where the ImageSuggestions Extension is enabled.

It is confusing, but AFAICT, ImageSuggestions sends weekly notifications to experienced users. So related, but disjoint? This seems apropos: https://w.wiki/Epfs

Also, as far as I can tell from code-search, wgGEImageRecommendationServiceUrl is not actually used by the ImageSuggestions extension, only by GrowthExperiments. (Hence the GE in the config name.)

That URL is then used in GrowthExperiments in this method ($this->url is the injected config value from wgGEImageRecommendationServiceUrl):

ProductionImageRecommendationApiHandler
	private function getRequest( array $pathArgs = [] ): MWHttpRequest {
		$request = $this->httpRequestFactory->create(
			$this->url . '/' . implode( '/', array_map( 'rawurlencode', $pathArgs ) ),
			[
				'method' => 'GET',
				'originalRequest' => RequestContext::getMain()->getRequest(),
				'timeout' => $this->requestTimeout,
				'sslVerifyCert' => $this->shouldVerifySsl,
				'sslVerifyHost' => $this->shouldVerifySsl,
			],
			__METHOD__
		);
		$request->setHeader( 'Accept', 'application/json' );
		return $request;
	}

Right, they don't share configuration. I think the referenced gerrit covers them both though. I've added you as a reviewer.

The data-gateway listener is now available (though unused) in production MediaWiki at localhost:6038.

One question that came up while reviewing the current state of configuration:

[...]

Specifically, its configuration appears to statically reference image-suggestion.discovery.wmnet, and those configuration keys don't appear to be overridden where the extension is enabled in mediawiki-config.

Are these keys unused in practice, or is the extension actually side-stepping the service mesh?

Is "statically reference image-suggestion.discovery.wmnet" something that is actually still working (or has ever worked) with the legacy image-suggestion service?

That extension is not even listed on https://www.mediawiki.org/wiki/Developers/Maintainers. I wonder in what form it was considered when dissolving the Structured Data team.

While we move forward with this, I think I'd like to figure out if that extension still does anything in the first place. (Its maintenance script is still being executed, still referencing the defunct Structured Data team: puppet config.) Is it actually sending any notifications to anyone?
Ideally, I'd like to know who owns it now (probably no-one, but that should at least be explicitly acknowledged.). Though those conversations tend to take a lot of time, and so that should not block this migration.

Right, they don't share configuration. I think the referenced gerrit covers them both though. I've added you as a reviewer.

The Gerrit commit looks fine as far as GrowthExperiments is concerned.

The data-gateway listener is now available (though unused) in production MediaWiki at localhost:6038.

One question that came up while reviewing the current state of configuration:

[...]

Specifically, its configuration appears to statically reference image-suggestion.discovery.wmnet, and those configuration keys don't appear to be overridden where the extension is enabled in mediawiki-config.

Are these keys unused in practice, or is the extension actually side-stepping the service mesh?

Is "statically reference image-suggestion.discovery.wmnet" something that is actually still working (or has ever worked) with the legacy image-suggestion service?

It is working, yes.

That extension is not even listed on https://www.mediawiki.org/wiki/Developers/Maintainers. I wonder in what form it was considered when dissolving the Structured Data team.

While we move forward with this, I think I'd like to figure out if that extension still does anything in the first place. (Its maintenance script is still being executed, still referencing the defunct Structured Data team: puppet config.) Is it actually sending any notifications to anyone?

I'd like to know this too. It's only enabled on a limited set of wikis. It does seem to send a handful of notifications, but if it's without an owner I think it's fair to ask whether it should continue to be deployed to production.

Ideally, I'd like to know who owns it now (probably no-one, but that should at least be explicitly acknowledged.). Though those conversations tend to take a lot of time, and so that should not block this migration.

Yes, we are way too comfortable leaving production software without ownership. I really wish we wouldn't do that, but if it's what's happening then we should at least be explicit and honest about it.

Right, they don't share configuration. I think the referenced gerrit covers them both though. I've added you as a reviewer.

The Gerrit commit looks fine as far as GrowthExperiments is concerned.

Mind adding your +1 to the Gerrit?

Right, they don't share configuration. I think the referenced gerrit covers them both though. I've added you as a reviewer.

The Gerrit commit looks fine as far as GrowthExperiments is concerned.

Mind adding your +1 to the Gerrit?

Done.

On our end, we will likely especially be watching two panels during the migration:
This panel shows the latency for the requests to the API at wgGEImageRecommendationServiceUrl
https://grafana.wikimedia.org/d/vGq7hbnMz/special3a-homepage-and-suggested-edits?orgId=1&from=now-7d&to=now&timezone=utc&var-platform=$__all&var-UserImpactHandlerPingLimiter=$__all&var-impactrendermode=$__all&viewPanel=panel-45
If the new data-gateway is substantially faster/slower, then that should be visible here.

And this panel shows the time to process the suggestions returned from the API:
https://grafana.wikimedia.org/d/vGq7hbnMz/special3a-homepage-and-suggested-edits?orgId=1&from=now-7d&to=now&timezone=utc&var-platform=$__all&var-UserImpactHandlerPingLimiter=$__all&var-impactrendermode=$__all&viewPanel=panel-189
So if for some reason the new API should not return suggestions, then that would be visible here because the time to process them would be collapsing.

[ ... ]

On our end, we will likely especially be watching two panels during the migration:
This panel shows the latency for the requests to the API at wgGEImageRecommendationServiceUrl
https://grafana.wikimedia.org/d/vGq7hbnMz/special3a-homepage-and-suggested-edits?orgId=1&from=now-7d&to=now&timezone=utc&var-platform=$__all&var-UserImpactHandlerPingLimiter=$__all&var-impactrendermode=$__all&viewPanel=panel-45
If the new data-gateway is substantially faster/slower, then that should be visible here.

And this panel shows the time to process the suggestions returned from the API:
https://grafana.wikimedia.org/d/vGq7hbnMz/special3a-homepage-and-suggested-edits?orgId=1&from=now-7d&to=now&timezone=utc&var-platform=$__all&var-UserImpactHandlerPingLimiter=$__all&var-impactrendermode=$__all&viewPanel=panel-189
So if for some reason the new API should not return suggestions, then that would be visible here because the time to process them would be collapsing.

This is great; Thanks for sharing this (I'll be keeping on eye on them myself).

FWIW, my expectation is that you'll see no difference. It is quite literally the same service, forked, renamed, with a number of additional endpoints added (which have been live in production now for some time). IOW, the code path for your endpoints is unchanged from the existing service.

After some discussion out of band, I'd like to propose Tuesday 2025-08-12 at 17:00 UTC (MediaWiki infra window) to deploy @Eevans' mediawiki-config patch.

Deploying on a Tuesday is ideal, since the ImageSuggestions jobs run early Wednesday UTC, so we won't need to wait long to see whether we've clearly broken things on that end.

@Michael - Any concerns about that timeline from the GrowthExperiments side of things?

After some discussion out of band, I'd like to propose Tuesday 2025-08-12 at 17:00 UTC (MediaWiki infra window) to deploy @Eevans' mediawiki-config patch.

Deploying on a Tuesday is ideal, since the ImageSuggestions jobs run early Wednesday UTC, so we won't need to wait long to see whether we've clearly broken things on that end.

@Michael - Any concerns about that timeline from the GrowthExperiments side of things?

I think it is fine. I'll be off that day, but we'll see the results of that in the morning. I'm not aware of anything being planned around image suggestions, so this is not going to interrupt anything on our side either.

@Scott_French I don't see this in https://wikitech.wikimedia.org/wiki/Deployments ... are you gonna do the deployment or do you want someone else to do it?

@Cparle - Thanks for the reminder, I've just not put it in the calendar yet. That said, it now seems I have a conflict with tomorrow's UTC-late window, so we may have to get creative with timing.

Just to confirm, did you want to be present for the deployment, and / or were there specific checks to perform that you had in mind?

Our plan for ImageSuggestions was mainly to verify that Wednesday's maintenance jobs complete without error (and movement of traffic from the image-suggestion service to data-gateway).

I don't think I need to be present, and your plan sounds good

Change #1171702 merged by jenkins-bot:

[operations/mediawiki-config@master] image-suggestion: reconfigure for data-gateway listener

https://gerrit.wikimedia.org/r/1171702

Mentioned in SAL (#wikimedia-operations) [2025-08-12T17:08:11Z] <swfrench@deploy1003> Started scap sync-world: Backport for [[gerrit:1171702|image-suggestion: reconfigure for data-gateway listener (T368096)]]

Mentioned in SAL (#wikimedia-operations) [2025-08-12T17:10:13Z] <swfrench@deploy1003> swfrench, eevans: Backport for [[gerrit:1171702|image-suggestion: reconfigure for data-gateway listener (T368096)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-08-12T17:30:39Z] <swfrench@deploy1003> Finished scap sync-world: Backport for [[gerrit:1171702|image-suggestion: reconfigure for data-gateway listener (T368096)]] (duration: 22m 28s)

Alright, a bit more than 20m on from when the deployment completed, things are looking good. Traffic to image-suggestion has fallen to 0 rps, and traffic has picked up on data-gateway (see also the view from envoy on the mediawiki side). No notable changes in latency for fetching or processing suggestions on the GrowthExperiments side either.

Next step will be to verify that the ImageSuggestions jobs complete without error this evening. This will start at 00:00 UTC Wednesday.

Eevans updated the task description. (Show Details)
Eevans updated the task description. (Show Details)

Alright, the first ImageSuggestions job (cawiki) seems to have completed without issue after a typical ~ 10m run duration. No errors reported in logstash on the ImageSuggestions channel other than the No more articles with suggestions found error typically emitted by he very last batch to execute.

The mediawiki-side envoy metrics look reasonable as well - e.g., no influx of non-2xx response codes that one might naively expect if we've somehow messed up the URL path format string.

In all, there are 3 5xx errors buried in there, all of which appear to be due to query timeouts - e.g., from pod/aqs-http-gateway-main-8567569b67-zcgmn:

{"@timestamp":"2025-08-13T00:05:48Z","message":"Operation timed out - received only 1 responses.","client":{"ip":"127.0.0.1","port":"46628"},"log":{"level":"ERROR"},"service":{"name":"data-gateway"},"trace":{"id":"db55fb89a3290ce8a1234400"},"ecs":{"version":"1.11.0"}}

Not quite sure how frequent those kinds of errors are expected to be in practice, but @Eevans might have some intuition.

Given the above, I think we're good to go to leave this as-is through the remainder of the runs.

@Cparle if you might be able to check that things look as you'd expect during your Wednesday, that would be greatly appreciated.

Alright, the first ImageSuggestions job (cawiki) seems to have completed without issue after a typical ~ 10m run duration. No errors reported in logstash on the ImageSuggestions channel other than the No more articles with suggestions found error typically emitted by he very last batch to execute.

The mediawiki-side envoy metrics look reasonable as well - e.g., no influx of non-2xx response codes that one might naively expect if we've somehow messed up the URL path format string.

In all, there are 3 5xx errors buried in there, all of which appear to be due to query timeouts - e.g., from pod/aqs-http-gateway-main-8567569b67-zcgmn:

{"@timestamp":"2025-08-13T00:05:48Z","message":"Operation timed out - received only 1 responses.","client":{"ip":"127.0.0.1","port":"46628"},"log":{"level":"ERROR"},"service":{"name":"data-gateway"},"trace":{"id":"db55fb89a3290ce8a1234400"},"ecs":{"version":"1.11.0"}}

Not quite sure how frequent those kinds of errors are expected to be in practice, but @Eevans might have some intuition.

I wouldn't expect these errors no; These come from the Cassandra coordinator node (the one the client is connected to), complaining that it timed out while getting enough responses to make quorum. And while there are never many, they do seem to recur, and at intervals that I assume correspond with the weekly runs?

image.png (867×1 px, 137 KB)

Average latency spikes correspond here as well:

image.png (644×1 px, 90 KB)

I wonder if there isn't something about the size of these responses that's making them problematic. I seem to recall we had issues with that in the past (importing more records than we'd planned for)

Edit:

It's worth noting that these errors did not seem to occur last week:

{F65747855}

That could still come down to an increase in response size (if they'd tipped over the point of creating timeouts in the intervening time), or it could be the result of some non-normative changes to the container image (Debian and/or Go version, driver, etc). It's not a lot of errors...

Summarizing some discussion out of band with @Cparle -

To evaluate how things are working, it may be useful to be able to see this log line that reports cumulative stats on the number of notifications / users for each processed batch. That should be possible if we set the ImageSuggestions channel log severity threshold to info in wmgMonologChannels.

That shouldn't be too noisy, since the periodic jobs do not set --verbose on SendNotificationsForUnillustratedWatchedTitles (which would emit a log for each notification sent). This would result in a couple hundred log lines per wiki processed, for each of 7 wikis (judging by the rate / number of ImageSuggestionsNotifications jobs).

We also have the --dry-run flag available for manually re-running a periodic job (e.g., the one for cawiki) without actually sending notifications.

Change #1178601 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/mediawiki-config@master] Reduce log level to 'info' on ImageSuggestions

https://gerrit.wikimedia.org/r/1178601

Change #1178611 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/deployment-charts@master] data-gateway: enable debug logging

https://gerrit.wikimedia.org/r/1178611

Change #1178601 merged by jenkins-bot:

[operations/mediawiki-config@master] Reduce log level to 'info' on ImageSuggestions

https://gerrit.wikimedia.org/r/1178601

Mentioned in SAL (#wikimedia-operations) [2025-08-13T20:17:57Z] <swfrench@deploy1003> Started scap sync-world: Backport for [[gerrit:1178601|Reduce log level to 'info' on ImageSuggestions (T368096)]]

Change #1178611 merged by jenkins-bot:

[operations/deployment-charts@master] data-gateway: enable debug logging

https://gerrit.wikimedia.org/r/1178611

Mentioned in SAL (#wikimedia-operations) [2025-08-13T20:20:13Z] <swfrench@deploy1003> swfrench: Backport for [[gerrit:1178601|Reduce log level to 'info' on ImageSuggestions (T368096)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-08-13T20:27:19Z] <swfrench@deploy1003> Finished scap sync-world: Backport for [[gerrit:1178601|Reduce log level to 'info' on ImageSuggestions (T368096)]] (duration: 09m 22s)

Mentioned in SAL (#wikimedia-operations) [2025-08-13T20:36:17Z] <swfrench-wmf> start manual equivalent of imagesuggestions-notifyunillustratedwatched-ca cronjob in --dry-run mode - T368096

The manual --dry-run mode SendNotificationsForUnillustratedWatchedTitles.php run (SAL) completed without issue, and we were able to get some useful data out of it.

I'll let @Eevans speak for the data-gateway and Cassandra bits, but for the logs produced by the ImageSuggestionsNotifications jobs, these look pretty reasonable, in that they show the script is making progress as expected despite the presence of a handful of 500s returned by data-gateway.

All logs from this run can be found here, with the final progress log being (direct link):

Finished job. In total have notified 71 users about 135 pages. Notifications not sent for -46 pages as they had no available users or the suggestions were excluded or didn't meet the confidence threshold.

What is curious is the negative value of $numMissing ... Not quite sure to make of that, or whether it might be a side-effect of --dry-run, though it's hard to see how it could be given the structure of the loop just above in Notifier::run. In any case, I doubt this is a new phenomenon.


Edit: I should have looked further up in the file ... In Notifier::__construct, if the intention is to have 'numPages' be persistent across jobs via the job params (as would seem to be the case given the way $numMissing is calculated), but default to zero, then the left and right hand sides of + seem to be reversed:

$this->jobParams = [ 'numPages' => 0 ] + $jobParams;

So yeah, in effect, jobParams['numPages'] starts from a clean slate for every Notifier instantiated, whereas jobParams['notifiedUserIds'] does not.

Which is to say, the In total have notified 71 users about 135 pages. portion of the log line is accurate (both calculated from jobParams['notifiedUserIds']), whereas the reported missing count is not.

Change #1178655 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/mediawiki-config@master] Remove remove unused image-suggestion from services

https://gerrit.wikimedia.org/r/1178655

Change #1178657 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/puppet@production] hieradata: disable and remove unused image-suggestion listener

https://gerrit.wikimedia.org/r/1178657

The 500s are indeed the result of query read timeouts at the coordinator nodes, and for the queries in question, they all reliably timeout even when ran from a command shell:

Connected to Analytics Query Service Storage at 10.64.0.199:9042
[cqlsh 6.1.0 | Cassandra 4.1.8 | CQL spec 3.4.6 | Native protocol v5]
Use HELP for help.
cassandra@cqlsh> SELECT * FROM image_suggestions.suggestions WHERE wiki = 'cawiki' AND page_id = 105477;
ReadTimeout: Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info={'consistency': 'ONE', 'required_responses': 1, 'received_responses': 0}
cassandra@cqlsh>

I am convinced this is unrelated to the migration, the timing a coincidence, and that we can proceed as planned. I will continue to follow-up on these timeouts though, and open a separate ticket to do so.

Edit:

T401877: aqs: Cassandra read timeouts

Change #1178655 abandoned by Scott French:

[operations/mediawiki-config@master] Remove remove unused image-suggestion from services

Reason:

Duplicates Iec83a5efc06e42092c641e4b5ef694b9a777e17d (not needed)

https://gerrit.wikimedia.org/r/1178655

Change #1179009 had a related patch set uploaded (by Scott French; author: Scott French):

[mediawiki/extensions/ImageSuggestions@master] Remove default API endpoints referencing image-suggestion

https://gerrit.wikimedia.org/r/1179009

Change #1179009 merged by jenkins-bot:

[mediawiki/extensions/ImageSuggestions@master] Remove default API endpoints referencing image-suggestion

https://gerrit.wikimedia.org/r/1179009

Looking at our longer term KPI dashboard that records actual user interactions (clicks, saves, reverts), I can confirm that we, Growth, do not see a change there either. 👍

Thanks for confirming, @Michael!

Alright, I think we're largely ready to remove support for image-suggestion entirely, and then turn down the service itself.

One oddity that come to light when I was double-checking that nothing sends traffic to image-suggestion anymore:

Indeed, it seems we still see a wee bit of traffic at times, all of which appears to have originated at mwdebug1002 per the non-k8s envoy telemetry metrics.

The fact that those requests transit envoy means that this is likely GrowthExperiments traffic, not ImageSuggestion (which was previously using the bare discovery addresses directly).

Why would mwdebug hosts still be using image-suggestion?

They're in the process of being turned down (T397498), and with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1169673 (merged 4th August), are no longer being deployed to.

Why would mwdebug hosts still be used at all?

This I don't have a great answer for. They're no longer accessible via XWD, but there may be cases where folks are SSH-tunneling directly and are unaware that the hosts are no longer expected to function as expected.

This seems like a more general communications problem, which I'll follow up on separately.

Change #1171703 merged by jenkins-bot:

[operations/mediawiki-config@master] image-suggestion: cleanup unused refs to service listener

https://gerrit.wikimedia.org/r/1171703

Mentioned in SAL (#wikimedia-operations) [2025-08-26T17:06:51Z] <swfrench@deploy1003> Started scap sync-world: Backport for [[gerrit:1171703|image-suggestion: cleanup unused refs to service listener (T368096)]]

Mentioned in SAL (#wikimedia-operations) [2025-08-26T17:12:53Z] <swfrench@deploy1003> eevans, swfrench: Backport for [[gerrit:1171703|image-suggestion: cleanup unused refs to service listener (T368096)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-08-26T17:19:06Z] <swfrench@deploy1003> Finished scap sync-world: Backport for [[gerrit:1171703|image-suggestion: cleanup unused refs to service listener (T368096)]] (duration: 12m 15s)

Change #1198575 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/puppet@production] service: move image-suggestion to service_setup

https://gerrit.wikimedia.org/r/1198575

Change #1198576 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/puppet@production] deployment_server: absent image-suggestion k8s creds config

https://gerrit.wikimedia.org/r/1198576

Change #1198577 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/puppet@production] deployment_server: remove absented image-suggestion k8s creds config

https://gerrit.wikimedia.org/r/1198577

Change #1198578 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/puppet@production] service: remove image-suggestion

https://gerrit.wikimedia.org/r/1198578

Change #1198579 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] image-suggestion: disable k8s ingress for turndown

https://gerrit.wikimedia.org/r/1198579

Change #1198580 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] image-suggestion: remove service configuration

https://gerrit.wikimedia.org/r/1198580

Change #1198579 abandoned by Scott French:

[operations/deployment-charts@master] image-suggestion: disable k8s ingress for turndown

Reason:

Not necessary - deletion of resources should be sufficient

https://gerrit.wikimedia.org/r/1198579

Change #1198584 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/dns@master] wmnet: remove image-suggestion k8s ingress CNAMEs

https://gerrit.wikimedia.org/r/1198584

@Scott_French do you expect to have this done by end of quarter, or do you need others' help on it?

@MLechvien-WMF - It looks like I have all of the patches ready, so it's just a question of rebasing and finding a couple of hours. Which is to say, I think it should be easy to (finally) get this done.

@Scott_French do we want to commit to this in next quarter or shall we put it in Backlog until we find the capacity to deploy it?

@MLechvien-WMF - Thanks for the reminder. I just rebased all of the patches and it looks like everything is still ready to go. I may be able to get all / most of this done tomorrow. If not, I think it makes sense to pick up in Q4 rather than backlog, since the service has no purpose anymore.

Change #1178657 merged by Scott French:

[operations/puppet@production] hieradata: disable and remove unused image-suggestion listener

https://gerrit.wikimedia.org/r/1178657

Mentioned in SAL (#wikimedia-operations) [2026-04-01T17:12:35Z] <swfrench@deploy1003> Started scap sync-world: helmfile-only deployment to remove unused image-suggestion listener - T368096

Mentioned in SAL (#wikimedia-operations) [2026-04-01T17:18:06Z] <swfrench@deploy1003> Finished scap sync-world: helmfile-only deployment to remove unused image-suggestion listener - T368096 (duration: 07m 25s)

Change #1198575 merged by Scott French:

[operations/puppet@production] service: move image-suggestion to service_setup

https://gerrit.wikimedia.org/r/1198575

Alright, the mesh listener has been removed, and the blackbox probes are no longer active.

I've also gone ahead and destroyed the staging resources to confirm that ingress seems to do the right thing without additional intervention. I'll move ahead with the production resources once I'm ready to move ahead with the remaining sequence.

Mentioned in SAL (#wikimedia-operations) [2026-04-01T22:48:11Z] <swfrench-wmf> removed unused image-suggestion service in codfw - T368096

Mentioned in SAL (#wikimedia-operations) [2026-04-01T22:58:20Z] <swfrench-wmf> removed unused image-suggestion service in eqiad - T368096

After further thought (and getting distracted by other issues), I decided to move ahead with destroying the production codfw and eqiad resources now rather than doing so together with merging / applying https://gerrit.wikimedia.org/r/1198576, https://gerrit.wikimedia.org/r/1198577, and https://gerrit.wikimedia.org/r/1198580 in quick succession.

Basically, even though it is extremely unlikely that anything might possibly (secretly) care about image-suggestion at this point, moving ahead with those patches makes resurrecting the service quite a bit slower.

In any case, I'll keep an eye on things for now and then move ahead with the rest first thing tomorrow.

Revert: In the meantime, should anything go awry where image-suggestion needs to come back, simply helmfile -e $CLUSTER apply in /srv/deployment-charts/helmfile.d/services/image-suggestion for each of staging, codfw, and eqiad.

Change #1198576 merged by Scott French:

[operations/puppet@production] deployment_server: absent image-suggestion k8s creds config

https://gerrit.wikimedia.org/r/1198576

Change #1198577 merged by Scott French:

[operations/puppet@production] deployment_server: remove absented image-suggestion k8s creds config

https://gerrit.wikimedia.org/r/1198577

Change #1198580 merged by jenkins-bot:

[operations/deployment-charts@master] image-suggestion: remove service configuration

https://gerrit.wikimedia.org/r/1198580

Change #1198584 merged by Scott French:

[operations/dns@master] wmnet: remove image-suggestion k8s ingress CNAMEs

https://gerrit.wikimedia.org/r/1198584

Change #1267137 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] fixtures: clean up reference to image-suggestion

https://gerrit.wikimedia.org/r/1267137

Change #1267137 merged by jenkins-bot:

[operations/deployment-charts@master] fixtures: clean up reference to image-suggestion

https://gerrit.wikimedia.org/r/1267137

Change #1198578 merged by Scott French:

[operations/puppet@production] service: remove image-suggestion

https://gerrit.wikimedia.org/r/1198578

This is now (finally!) done. Thanks, all!