Page MenuHomePhabricator

MediaModeration maintenance script scanFilesInScanTable.php indirectly calls $wgImageMagickConvertCommand
Closed, ResolvedPublic

Description

According to the comments in wmf-config/CommonSettings.php:

// Please note: neither command exists in the container, and this should
// never be called in production - still better to set it to something that could work on k8s.
$wgImageMagickConvertCommand = ClusterConfig::getInstance()->isK8s() ? '/usr/bin/convert' : '/usr/local/bin/mediawiki-firejail-convert';

However, the ThumbnailRender fails with a variety of errors, one of which is the job trying to call /usr/bin/convert Example log

This happened around 8k times in 1h, so we should revert ThumbnailRender to bare metal until we have fixed this.

After further investigation, the ThumbnailRender job was not implicated and was returned to mw-on-k8s. The call happens inside the jobs spawned by the MediaModeration maintenance script scanFilesInScanTable.php when using --use-jobqueue.

Related Objects

Event Timeline

Change 991377 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] changeprop-jobqueue: disable ThumbnailRender on k8s

https://gerrit.wikimedia.org/r/991377

Change 991377 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop-jobqueue: disable ThumbnailRender on k8s

https://gerrit.wikimedia.org/r/991377

Jdforrester-WMF raised the priority of this task from High to Unbreak Now!.Jan 17 2024, 5:18 PM

As we can see on a wider log view errors coincide with train deployments to group0 then group1

I see the errors still showing up in prod. By looking at Scap's code, I think redeploying the train to group1 should make this change go to prod.

@Clement_Goubert would that be enough or is there anything else required as far as you know?

After digging a bit more, I think a simpler sync-world with a few flags will be enough to deploy this Helm config change. I'll try it in a bit if there are no objections.

Mentioned in SAL (#wikimedia-operations) [2024-01-17T18:52:54Z] <jnuche@deploy2002> Started scap: deploying K8s config changes from T355243

Mentioned in SAL (#wikimedia-operations) [2024-01-17T18:54:37Z] <jnuche@deploy2002> Finished scap: deploying K8s config changes from T355243 (duration: 01m 42s)

Error rate seems unaffected after deploying the configuration change: https://logstash.wikimedia.org/goto/5f6d40ce5bbf6313e2fcec6ccc28ea51 :( Please advise.

I have a commitment soon and need to stop for the day. I've asked the backup conductor @jeena to follow up on this.

isn't a deployment of changeprop in kubernetes needed here? I don't think scap does this.

EDIT: I checked the logs and saw that the changeprop-jobqueue changes were applied

Hi @jeena , looking at the comments looks like this is not related to mw-on-k8s migration as I see @Clement_Goubert reverted to bare metal and the issue is still persisting. Please correct me if I am missing something.

Yes, I think that is the correct assessment (the revert to bare metal did not fix the error. But I'm also not sure what configuration changes were needed to make this change on the mediawiki side that needed a scap deploy), so we still need to figure out how to solve this issue.

Do you have a trace of how these calls occur? In theory I think this shouldn't be happening because thumbnailing requests should be directed to Thumbor and not MediaWiki. That routing usually happens in Swift.

What's supposed to happen is that the ThumbnailRender job makes an HTTP request to http://ms-fe.svc.codfw.wmnet/wikipedia/commons/thumb/4/4f/Ambroise_Thomas_[...] which is Swift routes to Thumbor after seeing if the file exists in storage already. Unless something is seriously misconfigured, it should make no difference where the job runs, it's not supposed to do any thumbnailing itself. (It can, but in Wikimedia production it's not configured to.)

Also there are ~100 errors per minute. ThumbnailRender tries to create four thumbnails per upload. There are usually 5-10 uploads per minute on Commons, I doubt the other wikis add that much. So this seems too frequent to come from the job. But it also isn't coming from normal browser requests, because then thumbnailing would be visibly broken and it isn't.

The error logs are weirdly spiky. Start around Tuesday 11:00, stop at 18:30, start again on Wednesday 11:00, stop again on Thursday 5:00. I think the closest match from SAL is T351400: Run the maintenance script scanning images in mediamoderation_scan on WMF wikis which uses

$thumbnail = $file->transform( [ 'width' => $this->thumbnailWidth ], File::RENDER_NOW );

I'm not that familiar with how Thumbor is integrated with the image handlers, but I think RENDER_NOW forces using local thumbnailing instead? cc @Dreamy_Jazz

EDIT: It looks like actually the metamoderation script actually spawns jobs, as interestingly all errors seem to come from the same reqId, 8e1438850af4ec4c4b82ebb6, and clearly it's not ThumbnailRenderer jobs.

Looking for this reqId in logstash, I found this other message:

MediaModerationFileScanner::scanSha1 returned this: <ul><li><p>Could not transform file Bauer_Bosch_VCC_836_-_drive_unit-49699.jpg: Error creating thumbnail: mkdir: cannot create directory '/sys/fs/cgroup/memory/mediawiki/job/1854857': No such file or directory

Which shows that this class tries to shell out locally to generate a thumbnail using a shellbox instance we haven't configured to act remotely.

This comment was removed by Tgr.

Clearly something has changed in the code so that now we're actually shelling out locally in the jobs you spawn.

Clearly something has changed in the code so that now we're actually shelling out locally in the jobs you spawn.

The jobqueue implementation patch is here, fwiw https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MediaModeration/+/989202

@Joe do you want us to stop the script for now, and switch to not using the job queue?

@Joe do you want us to stop the script for now, and switch to not using the job queue?

I mean, right now every step of your script is failing. We need to change that script to actually work with remote thumbnailing, or to make "local" thumbnailing use a remote shellbox before it can work with the jobqueue.

@Joe do you want us to stop the script for now, and switch to not using the job queue?

I mean, right now every step of your script is failing. We need to change that script to actually work with remote thumbnailing, or to make "local" thumbnailing use a remote shellbox before it can work with the jobqueue.

Not exactly. We are just sending the source image to PhotoDNA instead of a thumbnail, which tends to work out in many cases. (This also explains why there is a spike in https://grafana.wikimedia.org/d/STSXVVdSk/mediamoderation-photodna-stats?orgId=1&refresh=5m&viewPanel=22; originally we thought it was because of the higher rate of scanning, but now it's obvious that the thumbnailing part of the script is broken.)

The PhotoDNA API docs say "Alternatively, a publicly accessible URL of an image (gif, jpeg, png, bmp, or tiff) could be provided ... response time from the service may be adversely affected by external sources' download speeds." That seems like the minimal-effort short-term fix - just drop the RENDER_NOW, and put the thumb URL instead of the file source in the request.

The PhotoDNA API docs say "Alternatively, a publicly accessible URL of an image (gif, jpeg, png, bmp, or tiff) could be provided ... response time from the service may be adversely affected by external sources' download speeds." That seems like the minimal-effort short-term fix - just drop the RENDER_NOW, and put the thumb URL instead of the file source in the request.

We moved away from that because there were numerous issues with PhotoDNA receiving a 404 for the thumbnail, something for which we also had minimal ability to diagnose and debug, as the logs are on their side. We decided a while back in the mediamoderation 2.0 project to send the thumbnail contents (or source image as a fallback).

Thank you everyone for jumping on this.

It's not clear to me at this point if this is train-related after all. Should this ticket still be considered a train blocker?

Thank you everyone for jumping on this.

It's not clear to me at this point if this is train-related after all. Should this ticket still be considered a train blocker?

IMO, it is not train related. New code that shipped on the train allowed us to use --use-jobqueue with a maintenance script, which is why we're seeing these errors this week. We can disable that option.

Thank you everyone for jumping on this.

It's not clear to me at this point if this is train-related after all. Should this ticket still be considered a train blocker?

IMO, it is not train related. New code that shipped on the train allowed us to use --use-jobqueue with a maintenance script, which is why we're seeing these errors this week. We can disable that option.

@Joe @Tgr @jnuche what's the correct way to stop a script that another user has run?

I would like to stop the script invocation listed here https://phabricator.wikimedia.org/T351400#9465099

I've stopped the script running now and have removed T354432: 1.42.0-wmf.14 deployment blockers as a parent task. I will restart the script without the job queue option. However, this is likely to mean we are scanning at a slower rate and will need to move back to using the job queue sooner than later.

@kostajh @Dreamy_Jazz thank you, I can see the error rate going down. I'm going to proceed with the train.

The PhotoDNA API docs say "Alternatively, a publicly accessible URL of an image (gif, jpeg, png, bmp, or tiff) could be provided ... response time from the service may be adversely affected by external sources' download speeds." That seems like the minimal-effort short-term fix - just drop the RENDER_NOW, and put the thumb URL instead of the file source in the request.

While we don't attempt to generate thumbnails for deleted files we still send the original source file. In this case we cannot send a public URL as the image is deleted and so we need to send the file contents in the request.

Dreamy_Jazz lowered the priority of this task from Unbreak Now! to Needs Triage.EditedJan 18 2024, 2:54 PM

After I backported the patch in T355309: Don't use RENDER_NOW when generating thumbnails for PhotoDNA scans and restarted the script with the job queue method, I no longer see this error in logstash. The logstash entries being added by MediaModeration will be reduced to log and debug events as part of the patch in T355216: Keep a track of HTTP status codes from PhotoDNA and can be ignored.

As such the immediate problem is over and we should be able to unmark this as UBN. Leaving up to others to decide whether other work is needed for this, especially as it seems you cannot call File::transform with the RENDER_NOW flag while using a job.

what's the correct way to stop a script that another user has run?

Someone with root can kill it (send it a SIGINT). I don't think there's a better way, the same thing will happen when the user running the script presses Ctrl-C.

it seems you cannot call File::transform with the RENDER_NOW flag while using a job.

I don't think the job part is relevant, other for server selection; you can't use RENDER_NOW on k8s. The root issue is that RENDER_NOW breaks Thumbor integration. The same probably happens if you make a request to thumb.php directly. Without RENDER_NOW, transform() just returns a fake MediaTransformOutput with the thumbnail URL (which is deterministic given the image name and metadata so it can be calculated without doing the actual thumbnailing), and the thumbnailing happens on demand when someone requests the URL, by other layers of the infrastructure (specifically this Swift plugin) redirecting the request to Thumbor.

I can think of three ways of reenabling RENDER_NOW:

  • Do what ThumbnailRenderJob does and hit the URL. In fact you can probably just call ThumbnailRenderJob directly, so this is a good short-term fix that requires little code change. It might not work for deleted files and for files on private wikis (I guess the latter is not very relevant for media moderation), although the job hits Swift directly and not Varnish, so maybe that's all what's needed to avoid permission checks? I think it would work for old non-deleted versions of files, but not 100% sure about that either.
  • Use ShellBox for thumbnailing, as suggested in T355243#9467936. I don't think this is a good idea - we should not be using two parallel thumbnailing systems, which compete for the same thumbnail namespace (so when someone reports a problem with a thumbnail, you'd have to first figure out which thumbnailer was used, we'd have to deal with the bugs of both systems).
  • Fix transform() so it actually fetches the file from Thumbor. This is conceptually not very different from using ShellBox (you get the file contents from a web API) but simpler to do and also avoids duplication of thumbnailing services.

The root issue is that RENDER_NOW breaks Thumbor integration. The same probably happens if you make a request to thumb.php directly.

Although when I try this, there are a bunch of Thumbor-* headers on the response so it doesn't seem like it was created by MediaWiki shelling out. So I don't really understand what's going on - thumb.php also uses RENDER_NOW internally.

it seems you cannot call File::transform with the RENDER_NOW flag while using a job.

I don't think the job part is relevant, other for server selection; you can't use RENDER_NOW on k8s. The root issue is that RENDER_NOW breaks Thumbor integration. The same probably happens if you make a request to thumb.php directly. Without RENDER_NOW, transform() just returns a fake MediaTransformOutput with the thumbnail URL (which is deterministic given the image name and metadata so it can be calculated without doing the actual thumbnailing), and the thumbnailing happens on demand when someone requests the URL, by other layers of the infrastructure (specifically this Swift plugin) redirecting the request to Thumbor.

I can think of three ways of reenabling RENDER_NOW:

  • Do what ThumbnailRenderJob does and hit the URL. In fact you can probably just call ThumbnailRenderJob directly, so this is a good short-term fix that requires little code change. It might not work for deleted files and for files on private wikis (I guess the latter is not very relevant for media moderation), although the job hits Swift directly and not Varnish, so maybe that's all what's needed to avoid permission checks? I think it would work for old non-deleted versions of files, but not 100% sure about that either.
  • Use ShellBox for thumbnailing, as suggested in T355243#9467936. I don't think this is a good idea - we should not be using two parallel thumbnailing systems, which compete for the same thumbnail namespace (so when someone reports a problem with a thumbnail, you'd have to first figure out which thumbnailer was used, we'd have to deal with the bugs of both systems).
  • Fix transform() so it actually fetches the file from Thumbor. This is conceptually not very different from using ShellBox (you get the file contents from a web API) but simpler to do and also avoids duplication of thumbnailing services.

Thanks for the suggestions.

Although when I try this, there are a bunch of Thumbor-* headers on the response so it doesn't seem like it was created by MediaWiki shelling out. So I don't really understand what's going on - thumb.php also uses RENDER_NOW internally.

I think what happens (but not at all sure about it) is that the existence check in thumb.php makes Swift/Thumbor generate that file on demand. So I think you just need to copy that logic (we should probably encapsulate it in some File method for convenience).

Clement_Goubert renamed this task from ThumbnailRender job calls $wgImageMagickConvertCommand to MediaModeration maintenance script scanFilesInScanTable.php inderectly calls $wgImageMagickConvertCommand.Jan 22 2024, 11:48 AM
Clement_Goubert renamed this task from MediaModeration maintenance script scanFilesInScanTable.php inderectly calls $wgImageMagickConvertCommand to MediaModeration maintenance script scanFilesInScanTable.php indirectly calls $wgImageMagickConvertCommand.
Clement_Goubert triaged this task as Medium priority.
Clement_Goubert updated the task description. (Show Details)

I'm going to mark this as resolved as the relevant code no longer does this and in T356047 we should get the same behaviour without causing these errors.