Page MenuHomePhabricator

Some recent Commons uploads not available on other wikis (2020-11)
Closed, ResolvedPublic

Description

If you go to https://cs.wikipedia.org/w/index.php?title=Profimann&oldid=19149303, you will see this:

The file called Profiman.jpg is not loaded from Commons, despite it exists at https://commons.wikimedia.org/wiki/File:Profiman.jpg, and globalusage even shows it is used.

This can be reproduced at testwiki, see https://test.wikipedia.org/w/index.php?title=User:Martin_Urbanec/sandbox&oldid=455753.

I'm not sure whether it is a permanent or temporary bug.

Saving a null edit or previewing the page does not help. The file was uploaded at 16:12, 10 November 2020, which should be enough time for the image to propagate.

I cannot reproduce this with a different image so far.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a project: Commons. · View Herald TranscriptTue, Nov 10, 5:28 PM
Patriccck renamed this task from file from commons is not loaded properly at cswiki to File from commons is not loaded properly.Tue, Nov 10, 6:29 PM

Another example here.

Sounds like T253405: Some (recent?) uploads to Commons are not available on other wikis again. Looks like @jijiki's been doing some memcached work today relating to T244340, so that seems like a likely culprit. The files eventually appearing after some time is also consistent with a caching problem.

jijiki triaged this task as Medium priority.Tue, Nov 10, 9:02 PM

@AntiCompositeNumber the feature we have been working on has been enabled in mw1276 (api) and mw1263 (app), where items are fetched from the main memcached cluster, and cached locally for maximum 10s. At a first glance it does not appear to be related, but let's see how it unfolds.

I don't know if this is helpful, but when I tried editing [[Allan Shivers]] on en-wiki, which is showing the problem, I noticed that the image appears in "Show Preview", but on saving it goes back to showing a redlink with URL https://en.wikipedia.org/wiki/Special:Upload?wpDestFile=Allan_Shivers_(1956).png

Urbanecm raised the priority of this task from Medium to High.Wed, Nov 11, 11:32 PM

@AntiCompositeNumber the feature we have been working on has been enabled in mw1276 (api) and mw1263 (app), where items are fetched from the main memcached cluster, and cached locally for maximum 10s. At a first glance it does not appear to be related, but let's see how it unfolds.

I'm boldly raising the priority to High, as this breaks an important feature that's used by a lot of people, with no available workaround.

I don't know if this is helpful, but when I tried editing [[Allan Shivers]] on en-wiki, which is showing the problem, I noticed that the image appears in "Show Preview", but on saving it goes back to showing a redlink with URL https://en.wikipedia.org/wiki/Special:Upload?wpDestFile=Allan_Shivers_(1956).png

Hrm. Sounds like you hit a server that was in sync for the preview, but out of sync for the page load. Can you provide the response headers for the request that did and didn't work? The server and x-cache headers are most important. I'm not able to reproduce the behavior, hitting mw1325, mw1267, and mw1370 for page loads and mw1331, mw1327, mw1263, mw1321, mw1352, and mw1411 for previews.

Page load
HTTP/2 200 OK
date: Wed, 11 Nov 2020 23:30:01 GMT
server: mw1325.eqiad.wmnet
x-content-type-options: nosniff
content-language: en
vary: Accept-Encoding,Cookie,Authorization
expires: Thu, 01 Jan 1970 00:00:00 GMT
x-request-id: 59e5397f-9ea6-4dd3-8e59-299789b988f4
last-modified: Wed, 11 Nov 2020 23:17:52 GMT
content-type: text/html; charset=UTF-8
content-encoding: gzip
age: 0
x-cache: cp1089 miss, cp1079 pass
x-cache-status: pass
server-timing: cache;desc="pass"
strict-transport-security: max-age=106384710; includeSubDomains; preload
report-to: { "group": "wm_nel", "max_age": 86400, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }
nel: { "report_to": "wm_nel", "max_age": 86400, "failure_fraction": 0.05, "success_fraction": 0.0}
x-client-ip: ****
cache-control: private, s-maxage=0, max-age=0, must-revalidate
accept-ranges: bytes
X-Firefox-Spdy: h2
Preview
HTTP/2 200 OK
date: Wed, 11 Nov 2020 23:30:14 GMT
server: mw1331.eqiad.wmnet
x-content-type-options: nosniff
content-language: en
x-frame-options: DENY
vary: Accept-Encoding,Cookie,Authorization
expires: Thu, 01 Jan 1970 00:00:00 GMT
pragma: no-cache
x-request-id: ba71f451-ce4f-461c-9925-392dfe79a3e7
content-type: text/html; charset=UTF-8
content-encoding: gzip
age: 4
x-cache: cp1083 miss, cp1079 pass
x-cache-status: pass
server-timing: cache;desc="pass"
strict-transport-security: max-age=106384710; includeSubDomains; preload
report-to: { "group": "wm_nel", "max_age": 86400, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }
nel: { "report_to": "wm_nel", "max_age": 86400, "failure_fraction": 0.05, "success_fraction": 0.0}
x-client-ip: ****
cache-control: private, s-maxage=0, max-age=0, must-revalidate
accept-ranges: bytes
X-Firefox-Spdy: h2
AntiCompositeNumber renamed this task from File from commons is not loaded properly to Some recent Commons uploads not available on other wikis (2020-11).Wed, Nov 11, 11:54 PM
Krinkle added a subscriber: Krinkle.EditedThu, Nov 12, 1:24 AM

@AntiCompositeNumber the feature we have been working on has been enabled in mw1276 (api) and mw1263 (app), where items are fetched from the main memcached cluster, and cached locally for maximum 10s. At a first glance it does not appear to be related, but let's see how it unfolds.

I think this is a likely suspect. This issue was reported within 3 hours of onhost-memcached pooled (per T244340#6616207).

Afaik nothing else has has changed in the Memcached configuration recently, and unlike T253405 there are not currently multiple MW versions or different MW configs for Memc between wikis. In T253405 this problem resulted from a split-brain scenario due to the inability to set and purge the same Memcached key from different servers and wikis.

Can we turn off onhost-memcached until Monday to rule it out? At worst it's no different. At best, it resolves the issue and we can investigate with more time next week (given WMF holiday).

So, what we're seeing is that some files remain unavailable for many hours even upto 1.5 day and no amount of purging seems to help. Whether purging of host articles, local file pages, or Commons file pages.

I can also reproduce the issue consistently from mwdebug1001, which mostly rules out the onhost-memcached issue. It is still possible that a read was persisted elsewhere that mwdebug1001 would see, but that seems very unlikely would would have likely caused issues more generally if we were to do that sort of thing. I've also reproduced it from a Codfw server which has a cold memcached cluster separately from Eqiad and the issue happens there as well.

I then remember that we did actually make some changes around FileBackend caching, but it was more than a week ago - which, given that we went from wmf.14 to wmf.16 (skipping wmf.15) means it actually went out this Monday.

On mwdebug1001, I confirmed once more that this Commons file was not being rendered on this Wikidata item even after a purge, nor from Codfw mwdebug2001. Then I switched commonswiki and wikidatawiki to wmf.14 through wikiversions.php and purged the article, and it started to show up correctly. Success!

Comparing wmf,16 commits to wmf.14 commits, I suspect it this caused by ths commit somehow:

Change 640503 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] Revert "filerepo: clean up shared cache keys to avoid key metrics clutter"

https://gerrit.wikimedia.org/r/640503

Change 640504 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@wmf/1.36.0-wmf.16] Revert "filerepo: clean up shared cache keys to avoid key metrics clutter"

https://gerrit.wikimedia.org/r/640504

<Krinkle> likely culprit: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/604161
<Krinkle> I see a mis-matching cache key pattern there with regards to the descripton page.
<Krinkle> however those keys already didn't match before this commit....
<Krinkle> plus, afaik descripton pages are not relevant when parser is trying to link a file.
<Krinkle> OK, so I think I may've gotten to the heart of the issue.
<Krinkle> I think the issue is that LocalRepo no longer has a consistent getSharedCacheKey method. It previously returned a key always.
<Krinkle> whereas now only if hasSharedCache is true.
<Krinkle> Which, while true for the cross-wiki ForeignDBViaLBRepo object, won't be true for a LocalRepo object.
<Krinkle> hm.. nevermind. hasAccessibleSharedCache = true is set explicitly in the base class to keep this working. Only foreign ones take the hasSharedCache option.
<Krinkle> Aha! The commit added $this:FileRepo->getName() to the cache key. That didn't exist previously.
<Krinkle> That's presumably not gonna match, since it'll be local on one side, and something different when the foreign class deals with that wiki.

This comment was removed by Krinkle.
Krinkle added a comment.EditedThu, Nov 12, 3:53 AM

I'm unable to find anyone online to approve or be deploy buddy.

To the next person here: Hey you, deploy the below patch at your earliest convenience :)

This comment was removed by Krinkle.

Change 640746 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@wmf/1.36.0-wmf.16] Revert "filerepo: clean up shared cache keys to avoid key metrics clutter"

https://gerrit.wikimedia.org/r/640746

Not many people are around, and most importantly no one with extensive WanCache experience. @ArielGlenn and I are thinking of merging the patch and testing it on a debug server to see if it fixes the issue for files where we see it.

hashar added a subscriber: hashar.Thu, Nov 12, 8:33 AM

Change 640746 merged by jenkins-bot:
[mediawiki/core@wmf/1.36.0-wmf.16] Revert "filerepo: clean up shared cache keys to avoid key metrics clutter"

https://gerrit.wikimedia.org/r/640746

Mentioned in SAL (#wikimedia-operations) [2020-11-12T09:12:25Z] <hashar> Pulled https://gerrit.wikimedia.org/r/640746 on deployment server for # T267668

Mentioned in SAL (#wikimedia-operations) [2020-11-12T09:19:09Z] <hashar@deploy1001> Synchronized php-1.36.0-wmf.16/includes/filerepo: Revert "filerepo: clean up shared cache keys to avoid key metrics clutter" - T267668 (duration: 01m 01s)

Joe added a comment.Thu, Nov 12, 9:21 AM

The current situation is:

  • this should not happen for new images. The revert seems to have fixed that issue.
  • for images already uploaded you will need to forcibly purge the cache.

And there is now some method signature mismatches:

PHP Warning: Declaration of ForeignDBViaLBRepo::getSharedCacheKey(...$args) should be compatible with LocalRepo::getSharedCacheKey($kClassSuffix, ...$components)

#0 /srv/mediawiki/php-1.36.0-wmf.16/includes/AutoLoader.php(109): MWExceptionHandler::handleError(integer, string, string, integer, array)
#1 /srv/mediawiki/php-1.36.0-wmf.16/includes/AutoLoader.php(109): require()
#2 [internal function]: AutoLoader::autoload(string)
#3 /srv/mediawiki/php-1.36.0-wmf.16/includes/filerepo/RepoGroup.php(422): spl_autoload_call(string)
#4 /srv/mediawiki/php-1.36.0-wmf.16/includes/filerepo/RepoGroup.php(408): RepoGroup->newRepo(array)
#5 /srv/mediawiki/php-1.36.0-wmf.16/includes/filerepo/RepoGroup.php(374): RepoGroup->initialiseRepos()
#6 /srv/mediawiki/php-1.36.0-wmf.16/includes/OutputPage.php(1924): RepoGroup->forEachForeignRepo(Closure)
#7 /srv/mediawiki/php-1.36.0-wmf.16/includes/OutputPage.php(2018): OutputPage->addParserOutputMetadata(ParserOutput)
#8 /srv/mediawiki/php-1.36.0-wmf.16/includes/page/Article.php(742): OutputPage->addParserOutput(ParserOutput, array)
#9 /srv/mediawiki/php-1.36.0-wmf.16/includes/actions/ViewAction.php(74): Article->view()
#10 /srv/mediawiki/php-1.36.0-wmf.16/includes/MediaWiki.php(530): ViewAction->show()
#11 /srv/mediawiki/php-1.36.0-wmf.16/includes/MediaWiki.php(316): MediaWiki->performAction(Article, Title)
#12 /srv/mediawiki/php-1.36.0-wmf.16/includes/MediaWiki.php(943): MediaWiki->performRequest()
#13 /srv/mediawiki/php-1.36.0-wmf.16/includes/MediaWiki.php(546): MediaWiki->main()
#14 /srv/mediawiki/php-1.36.0-wmf.16/index.php(53): MediaWiki->run()
#15 /srv/mediawiki/php-1.36.0-wmf.16/index.php(46): wfIndexMain()
#16 /srv/mediawiki/w/index.php(3): require(string)
#17 {main}

The getSharedCacheKey() methods definitely have ...$args as argument in all classes after the revert. The PHP warning must be related to the deployment of code not being atomic at all (either due to rsync or and PHP opcache), so there is a short window of time during which the code is inconsistent and still used to serve requests :-\

At a quick glance that remained inconsistent for a few seconds. So I am leaning toward opcache refresh.

hashar closed this task as Resolved.Thu, Nov 12, 4:24 PM
hashar assigned this task to Krinkle.

Confirming that 500/500 of the most recent uploads are working correctly. Additionally, the files identified previously are also all working correctly now.

Thanks @Krinkle and everyone else who was involved in this!

Change 640868 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] filerepo: remove repo name from getSharedCacheKey()

https://gerrit.wikimedia.org/r/640868

Change 640503 abandoned by Krinkle:
[mediawiki/core@master] Revert "filerepo: clean up shared cache keys to avoid key metrics clutter"

Reason:

https://gerrit.wikimedia.org/r/640503

Change 640868 merged by jenkins-bot:
[mediawiki/core@master] filerepo: remove repo name from getSharedCacheKey()

https://gerrit.wikimedia.org/r/640868

Change 641290 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[mediawiki/core@wmf/1.36.0-wmf.18] filerepo: remove repo name from getSharedCacheKey()

https://gerrit.wikimedia.org/r/641290

Change 641290 merged by jenkins-bot:
[mediawiki/core@wmf/1.36.0-wmf.18] filerepo: remove repo name from getSharedCacheKey()

https://gerrit.wikimedia.org/r/641290

Urbanecm reopened this task as Open.Thu, Nov 19, 11:26 AM

I do not think that this got fixed completely, see https://cs.wikipedia.org/w/index.php?title=Wikipedista:Martin_Urbanec/P%C3%ADskovi%C5%A1t%C4%9B/3&oldid=19177452. It's a recently moved file, but it sounds very similar to this one. Reopening.

AntiCompositeNumber added a comment.EditedThu, Nov 19, 3:38 PM

18 out of the last 500 files uploaded to Commons (3.6%) are affected.

/wiki/File:Andi%C5%86i.jpg: HTTP/2 404 
/wiki/File:METACLISI.jpg: HTTP/2 404 
/wiki/File:Irina_Nikitina.jpg: HTTP/2 404 
/wiki/File:-i---i-_(48404265022).jpg: HTTP/2 200 
/wiki/File:-i---i-_(48404131936).jpg: HTTP/2 200 
/wiki/File:-i---i-_(48404289902).jpg: HTTP/2 200 
/wiki/File:%D0%9F%D0%B0%D1%80%D0%BA_%D0%BF%D0%B0%D1%80%D1%82%D0%B8%D0%B7%D0%B0%D0%BD%D1%81%D0%BA%D0%BE%D0%B9_%D1%81%D0%BB%D0%B0%D0%B2%D1%8B,_%D1%84%D0%BE%D0%BD%D1%82%D0%B0%D0%BD.jpg: HTTP/2 404 
/wiki/File:Mohammed_Vakil.jpg: HTTP/2 404 
/wiki/File:Antheren_von_Acanthus_hungaricus.png: HTTP/2 404 
/wiki/File:Acanthus_hungaricus_mit_Holzbiene.png: HTTP/2 404 
/wiki/File:%D0%95%D0%BA%D0%B2%D0%B0%D1%82%D0%BE%D1%802.jpg: HTTP/2 404 
/wiki/File:Bio_parque_.jpg: HTTP/2 404 
/wiki/File:Hamun(Hamoun)_wetlands_Iran-Afghanistan_Border_%D9%87%D8%A7%D9%85%D9%88%D9%86.jpg: HTTP/2 404 
/wiki/File:KUD_Marice_Keren%C4%8Di%C4%8D_Pesnica_logo.jpg: HTTP/2 404 
/wiki/File:%D0%95%D0%BA%D0%B2%D0%B0%D1%82%D0%BE%D1%801.jpg: HTTP/2 404 
/wiki/File:%E6%9C%88%E9%96%93%E3%83%A9%E3%83%B3%E3%82%AF%E8%A1%A8_%E8%A1%A8%E9%9D%A2.jpg: HTTP/2 404 
/wiki/File:Gabajeva_Greda_jezero2.jpg: HTTP/2 404 
/wiki/File:A07A1550small.jpg: HTTP/2 404
dancy added a subscriber: dancy.Thu, Nov 19, 4:46 PM

@dancy: Should this be UBN as a blocker?

dancy raised the priority of this task from High to Unbreak Now!.Thu, Nov 19, 5:42 PM

@dancy: Should this be UBN as a blocker?

Yes it should. Changing the priority now.

Several affected files even do not show on Commons (example: https://commons.wikimedia.org/wiki/File:Antheren_von_Acanthus_hungaricus.png

For these files it is sufficient to use Commons-function "Purge this page's cache" and the images will a) display on Commons and b) are visible in WP-articles.
Example for past cache-purge: https://commons.wikimedia.org/wiki/File:Acanthus_hungaricus_mit_Holzbiene.png

dancy reassigned this task from Krinkle to aaron.Thu, Nov 19, 6:51 PM

Several affected files even do not show on Commons (example: https://commons.wikimedia.org/wiki/File:Antheren_von_Acanthus_hungaricus.png

For these files it is sufficient to use Commons-function "Purge this page's cache" and the images will a) display on Commons and b) are visible in WP-articles.

Commonswiki was rollbacked to 1.36.0-wmf.16: T263184#6634169. That's probably the reason why the purges work now.
Could the problem be an inconsistent state of changes related to this task between 1.36.0-wmf.16 and 1.36.0-wmf.18?

1.36.0-wmf.16 has a change https://gerrit.wikimedia.org/r/640746 (Revert "filerepo: clean up shared cache keys to avoid key metrics clutte). This was (as far as I can see) NOT merged into master and NOT reverted. Master and 1.36.0-wmf.18 have the (supposed) fix instead: https://gerrit.wikimedia.org/r/640868, https://gerrit.wikimedia.org/r/641290. This was NOT backported into 1.36.0-wmf.16. In a situation, when commonswiki was already on 1.36.0-wmf.18, but other wikis still on 1.36.0-wmf.16, I can imagine this scenario could be responsible for the reported problems.

AntiCompositeNumber added a comment.EditedThu, Nov 19, 7:47 PM

And now we have the inverse situation too: https://commons.wikimedia.org/wiki/File:DJEFF_Official.jpg was deleted, but the file metadata still appears. It's like the file description page was deleted but not the image. https://quarry.wmflabs.org/query/49941 shows that the data in the database is correct, so the issue must be in the cache somewhere.

Currently, (after the rollback), 500/500 of the most recent uploads to Commons are appearing correctly

Change 642135 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@wmf/1.36.0-wmf.16] filerepo: remove repo name from getSharedCacheKey()

https://gerrit.wikimedia.org/r/642135

Change 642135 merged by jenkins-bot:
[mediawiki/core@wmf/1.36.0-wmf.16] filerepo: remove repo name from getSharedCacheKey()

https://gerrit.wikimedia.org/r/642135

Mentioned in SAL (#wikimedia-operations) [2020-11-19T22:06:49Z] <krinkle@deploy1001> Synchronized php-1.36.0-wmf.16/includes/filerepo/: T267668 - I1115135ee, and Ic239bb9807 (duration: 01m 07s)

Krinkle closed this task as Resolved.Fri, Nov 20, 12:48 AM
Urbanecm reopened this task as Open.Fri, Nov 20, 12:06 PM

Sorry @Krinkle, but reopening again. The test case mentioned in my reopening comment: https://cs.wikipedia.org/w/index.php?title=Wikipedista:Martin_Urbanec/P%C3%ADskovi%C5%A1t%C4%9B/3&oldid=19177452, is still broken. Both images exist at Commons, and none displays. Purging did not help.

It's displaying correctly for me...

It was just like that for me (Both were shown with red link), but after purging, both pictures were shown correctly; Just like the picture above. Here: https://ckb.wikipedia.org/w/index.php?title=بەکارھێنەر:Aram/تاقیکردنەوە&oldid=730935

Urbanecm closed this task as Resolved.Fri, Nov 20, 12:56 PM

It's displaying correctly for me...

Hmm, works fine on mobile. Closing again, probably an issue on my end.

Mmballen3 set Due Date to Mon, Nov 23, 12:00 AM.Mon, Nov 23, 5:36 AM

1.36.0-wmf.18 has been rolled to commons.wikimedia.org Nov 23 at 20:09 UTC.