Page MenuHomePhabricator

Latest image thumbnails aren't replaced correctly after image reupload
Closed, ResolvedPublicBUG REPORT

Description

Hello. It's a couple of hours there is a bug with file uploading. Maybe something went wrong on database switch, or it's a problem of this week deployment, don't know.
If you reupload a file, the new thumb is not created. The old version shown on file page, including in the versions table. It also continues to appear in articles. There is no way to use a new version. There is a way to see a new version, in the [[Media:]] namespace.
The problem happens on Commons and on local wikis. The usual solutions, as clear cache, purge or null edit, do not help. Reading the Commons reuploading log shows it's not just my problem.

  • Open some existing file, already used on wikis, either on Commons (I've opened this one), or local (I've opened this one and more)
  • Download it to your computer.
  • Edit it (I changed the Печатники osi on the first, and added a huge word "Draft" on the second. If you don't speak russion, use the second one, this word can be found easily).
  • Upload the new version.
  • Clear the cache, purge, make a null edit.
  • Click on the file thumb at the top of the file page.
  • Behold the new version without any problems.
  • Go back to the file page.
  • Behold the same thumb.

Expected: You should see the new version.

Got: You can see the old version without any changes.

Other information (browser name/version, screenshots, etc.):
Observed on Samsung Internet logged in, checked on Firefox logged out.

You can try the files I mentioned to check this right now. Thank you.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I can reproduce: https://test.wikipedia.org/wiki/File:Testing_uploads.png (the first thumbnail doesn't update for me at the moment of writing this comment)

hashar triaged this task as Unbreak Now! priority.EditedMar 2 2023, 9:44 AM
hashar subscribed.

Looks like the images can still be replaced by new version but the thumb being shown is stalled. Sounds like a cache issue of some sort?

When looking at https://test.wikipedia.org/wiki/File:Testing_uploads.png , the main image shows some entirely orangeish image but in the file history. Here is the page screenshot:

{F36887962}

That orange page is expected, it what got uploaded. But in the file history the most recent thumbnail is wrong ;)

jcrespo renamed this task from Images can't be edited any more to Images thumbnails aren't served correctly after image reupload.Mar 2 2023, 9:46 AM
jcrespo renamed this task from Images thumbnails aren't served correctly after image reupload to Image thumbnails aren't served correctly after image reupload.
Joe subscribed.

This is an UBN issue. Adding traffic as this is either an issue with jobs not being enqueued to regenerate the thumbnail, or the cdn cache not being purged.

To clarify why I don't think the issue is with swift or thumbor, here is one thumbnail with the incorrect image:

https://upload.wikimedia.org/wikipedia/ru/thumb/1/17/Moscow_metro_map_ru_sb_future_draft.svg/461px-Moscow_metro_map_ru_sb_future_draft.svg.png (no DRAFT)

but if we request it at 1 px larger

https://upload.wikimedia.org/wikipedia/ru/thumb/1/17/Moscow_metro_map_ru_sb_future_draft.svg/462px-Moscow_metro_map_ru_sb_future_draft.svg.png

this is correctly generated.

Things we know up to now:

  • Originals get correctly replaced in swift
  • New thumbnails (so thumbs for new images, or new sizes for existing images) get correctly generated
  • Old thumbnails are not replaced in swift
  • Substituting an image does correctly spawn jobs to replace thumbnails and those are correctly executed

It seems that the ThumbnailRender job fails to replace thumbnails in swift. We need I think help from someone with experience with Filebackend to debug this further.

Further investigating:

  • @jcrespo re-uploaded the test image he was using, and I could see jobs spawned and completed successfully for regenerating thumbnails at 1024,1280,640 and 320 px
  • No request to thumbor in codfw to regenerate any of those thumbnails at that time; just a request for 120px, which is what is shown in the page.

So the problem is clearly that something, either the job or swift, are not cleaning the existing thumbnails correctly.

From swift proxy logs:
I see three PUTs at the relevant times (around 10:19) to

	  /v1/AUTH_mw/wikipedia-test-local-public/archive/8/8c/20230302101948%2521Testing_uploads.png
	  /v1/AUTH_mw/wikipedia-test-local-public/8/8c/Testing_uploads.png
	  /v1/AUTH_mw/wikipedia-test-local-thumb/archive/8/8c/20230302101948%2521Testing_uploads.png/120px-20230302101948%2521Testing_uploads.png

all of which return 201

There was only one DELETE this morning,

09:49:54 - DELETE
	  /v1/AUTH_mw/wikipedia-test-local-thumb/archive/8/8c/20230215222703%2521Testing_uploads.png/120px-20230215222703%2521Testing_uploads.png
	  HTTP/1.0 204

Which I think @jcrespo was expecting.

@aaron @Krinkle we're trying to find devs that may help with this.

After some analysis done by SRE (see discussion in irc security channel), it's possible the problem originates in the ThumbnailRender, would you know who to flag here?

FYI. Somebody told me she could see the new thumb on the top of the file page yesterday, after she couldn't, a couple of hours after I saw the problem, and when I still can't. Looks like it is shown differently for different users.

FYI. Somebody told me she could see the new thumb on the top of the file page yesterday, after she couldn't, a couple of hours after I saw the problem, and when I still can't. Looks like it is shown differently for different users.

Yes, the issue is that thumbnails don't get correctly regenerated. New thumbnails (e.g. if requested for new sizes) seem to be working fine. However, it is not a cache issue, the bad thumb is on swift.

So far I am seeing the issue starting approximately at 21:30-22:00 yesterday March 1, but that has a large error bar (also if we have into account the delay between a possible deploy and it taking into effect). This is the earliest occurrence I found: https://commons.wikimedia.org/wiki/File:BSicon_num(R)6r.svg

And one more. I can see now the right version in the articles, and it wasn't like this 9 hours ago.

jcrespo renamed this task from Image thumbnails aren't served correctly after image reupload to Latest image thumbnails aren't replaced correctly after image reupload.Mar 2 2023, 11:39 AM

Every time someone re-uploads an image, some jobs are triggered to build the thumbnail on "default sizes". the sizes are: [ 320, 640, 800, 1024, 1280, 1920 ]. That job is clearly broken. If you take a look at https://test.wikipedia.org/wiki/File:Wikitech-2021-blue-large-icon-ww_(4th_copy).svg. All thumbnails in the sizes mentioned are wrong and everything else is correct. It can be either the job completely failing to execute (or shortening out) or it's replacing the wrong file in the swift. Debugging further.

@Ladsgroup that matches the deployment of https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/891962 which we were suspecting being the culprit already. It seems relevant indeed. I propose we revert that patch.

Change 893652 had a related patch set uploaded (by Ladsgroup; author: Ladsgroup):

[operations/mediawiki-config@master] Revert "filebackend: Replace stringified class names with ::class"

https://gerrit.wikimedia.org/r/893652

@Ladsgroup that matches the deployment of https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/891962 which we were suspecting being the culprit already. It seems relevant indeed. I propose we revert that patch.

Actually: it doesn't. That was one hour earlier.

This is special because of this:

	/**
	 * Whether to retry the job.
	 * @return bool
	 */
	public function allowRetries() {
		// ThumbnailRenderJob is a warmup for the thumbnails cache,
		// so loosing it is not a problem. Most times the job fails
		// for non-renderable or missing images which will not be fixed
		// by a retry, but will create additional load on the renderer.
		return false;
	}

My theory is that the job fails for whatever reason and the old thumbnails doesn't get updated. I'm planning to run it in debug server to confirm.

Maybe I'm doing it wrong but the job in test setup returns true while actually not updating the thumbnail (it's quite fast, way too fast for a re-render of thumbnail)

ladsgroup@mwdebug2002:~$ mwscript eval.php --wiki=testwiki --profiler=text
> $job = new ThumbnailRenderJob(Title::newFromText('File:Wikitech-2021-blue-large-icon-ww (4th copy).svg'),[ 'transformParams' => ['width' => 640 ] ] );

> $res = $job->run();

> var_dump( $res );
bool(true)

But now I can play with the code and make it emit warnings.

The thumbnail URL it hits to rerender is the user-facing, cached one:

https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-deploy-1-7.0.0-1-2023.03.02?id=xtRSooYBwEtI0jFYRUOV

//upload.wikimedia.org/wikipedia/test/thumb/1/1f/Wikitech-2021-blue-large-icon-ww_%284th_copy%29.svg/640px-Wikitech-2021-blue-large-icon-ww_%284th_copy%29.svg.png [Called from ThumbnailRenderJob::hitThumbUrl in /srv/mediawiki/php-1.40.0-wmf.25/includes/jobqueue/jobs/ThumbnailRenderJob.php at line 92]

Which obviously get the old image, so I know the reason it's broken but I don't understand how this worked in the first place, If I can figure out the difference, then it should be easy to fix.

I'm sure of the root cause, the purging of old thumbnails during reupload short circuits (most likely because FileBackend is set to read only in codfw and didn't get fixed during switchover, I'm guessing, need to double check)

See https://performance.wikimedia.org/xhgui/run/symbol?id=64009fc3b8a581c635362c3b&symbol=FileBackend%3A%3AdoQuickOperations and it never gets to call doQuickOperationsInternal

It is shortening out in:

		if ( empty( $opts['bypassReadOnly'] ) && $this->isReadOnly() ) {
			return $this->newStatus( 'backend-fail-readonly', $this->name, $this->readOnly );
		}

Let me double check and make the patch.

Change 893652 abandoned by Jforrester:

[operations/mediawiki-config@master] Revert "filebackend: Replace stringified class names with ::class"

Reason:

No longer thought to be at fault.

https://gerrit.wikimedia.org/r/893652

ha

ladsgroup@mwdebug2002:~$ mwscript eval.php --wiki=testwiki --d 3

*******************************************************************************
NOTE: Do not run maintenance scripts directly, use maintenance/run.php instead!
      Running scripts directly has been deprecated in MediaWiki 1.40.
      It may not work for some (or any) scripts in the future.
*******************************************************************************

[debug] [memcached] MemcachedPeclBagOStuff::initializeClient: initializing new client instance.
[debug] [memcached] MainWANObjectCache using store MemcachedPeclBagOStuff
[debug] [rdbms] Wikimedia\Rdbms\LoadBalancer::reallyOpenConnection: opened new connection for 0/testwiki
[debug] [rdbms] Wikimedia\Rdbms\LBFactory::getChronologyProtector: request info {
    "IPAddress": "127.0.0.1",
    "UserAgent": false,
    "ChronologyProtection": false,
    "ChronologyPositionIndex": 0,
    "ChronologyClientId": false
}
[debug] [rdbms] Wikimedia\Rdbms\LoadBalancer::loadSessionPrimaryPos: executed chronology callback.
[debug] [rdbms] Wikimedia\Rdbms\LoadBalancer::pickReaderIndex: connecting to db2149...
[debug] [rdbms] Wikimedia\Rdbms\LoadBalancer::reallyOpenConnection: opened new connection for 5/
[debug] [rdbms] Wikimedia\Rdbms\LoadBalancer::getReaderIndex: using server db2149 for group ''
[debug] [rdbms] Wikimedia\Rdbms\LoadBalancer::reuseOrOpenConnectionForNewRef: reusing connection for 5/testwiki
> 

> $backend = MediaWiki\MediaWikiServices::getInstance()->getFileBackendGroup()->get( 'local-multiwrite' );

> $iterator = $backend->getFileList( [ 'dir' => "mwstore://local-multiwrite/local-thumb/1/1f/Wikitech-2021-blue-large-icon-ww_(4th_copy).svg" ] );

> foreach ( $iterator as $file ) { var_dump( $file ); }
[debug] [FileOperation] HTTP start: GET https://ms-fe.svc.eqiad.wmnet/auth/v1.0
[debug] [FileOperation] HTTP complete: GET https://ms-fe.svc.eqiad.wmnet/auth/v1.0 code=200 size=0 total=0.102922 connect=0.031767
[debug] [FileOperation] HTTP start: GET https://ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-test-local-thumb
[debug] [FileOperation] HTTP complete: GET https://ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-test-local-thumb code=204 size=0 total=0.062615 connect=0.000053

Note that it's hitting eqiad and not codfw.

Change 893796 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/mediawiki-config@master] filebackend: hotfix - make swift master follow the mediawiki master

https://gerrit.wikimedia.org/r/893796

adding @Krinkle and @Reedy based on this.

Adding @aaron based on discussion in IRC.

Change 893796 merged by jenkins-bot:

[operations/mediawiki-config@master] filebackend: hotfix - make swift master follow the mediawiki master

https://gerrit.wikimedia.org/r/893796

Mentioned in SAL (#wikimedia-operations) [2023-03-02T16:59:49Z] <oblivian@deploy2002> Started scap: Backport for [[gerrit:893796|filebackend: hotfix - make swift master follow the mediawiki master (T330942)]]

Mentioned in SAL (#wikimedia-operations) [2023-03-02T17:01:39Z] <oblivian@deploy2002> oblivian: Backport for [[gerrit:893796|filebackend: hotfix - make swift master follow the mediawiki master (T330942)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet

@IKhitron we think the patch above should've put up an hotfix to the issue you're seeing, meaning that if you make a new reupload it should correctly update the thumbnails you see.

There are deeper issues in play here but those go beyond the scope of what I can do as an SRE to fix the situation. Let me know if things are indeed fixed, else we'll get more people involved.

Mentioned in SAL (#wikimedia-operations) [2023-03-02T17:09:06Z] <oblivian@deploy2002> Finished scap: Backport for [[gerrit:893796|filebackend: hotfix - make swift master follow the mediawiki master (T330942)]] (duration: 09m 16s)

Tested the fix through regular access and when uploading a new image, for a second the thumb was not updated but it was on page reload (unlike before). I think this is normal behaviour (it took 1 second to process/my browser cache clearing), so the issue seems fixed on new reuploads. https://test.wikipedia.org/wiki/File:Testing_uploads.png

It definitely works, thanks a lot.

@IKhitron if there are outstanding outdated files, purging the page now works, replacing the bad thumbnail with a fresh one! I did this successfully here, for example: https://commons.wikimedia.org/wiki/File:BSicon_num(R)6r.svg

@IKhitron if there are outstanding outdated files, purging the page now works, replacing the bad thumbnail with a fresh one! I did this successfully here, for example: https://commons.wikimedia.org/wiki/File:BSicon_num(R)6r.svg

Indeed, I've already tried this earlier.

Joe lowered the priority of this task from Unbreak Now! to High.Mar 2 2023, 5:37 PM

For now reducing priority, but not resolving as there are big underlying issues. When I've had the time to open a task describing those, we can resolve this one.

It definitely works, thanks a lot.

Thank you for the report and your help, it is really appreciated! And sorry again for the trouble we caused you.

Joe claimed this task.

The problem is back, see the same local file in the description (again, thumbs are not created, files are shown in articles with standard sizes only). Should we reopen this task, or create a new one?

It might be T333042?

As of description there - no chance.

Anybody, please? It doesn't work for days, there is no way to upload new files, either on Commons or locally.

It might be T333042?

As of description there - no chance.

Read comments as well. It's the same. The description is misleading.

Anybody, please? It doesn't work for days, there is no way to upload new files, either on Commons or locally.

I can take a look at it later in the day but I spent a full day on this last time and this is not really my job (my responsibilities are about databases)

This comment was removed by IKhitron.

Read comments as well. It's the same. The description is misleading.

Still, they talk about Commons.

Commons is just a different container in the infra, the configuration is the same.

They say never happens on local wikis.

Folks on Commons are reporting that this is still a problem.

I did a bit of investigation, I'm sure this issue in itself is fixed and what people are seeing is T331138: FileBackendMultiWrite multi-dc and thumbnail handling. Looks the same problem but completely different underlying problem. I'll write in depth in the main ticket.