Page MenuHomePhabricator

Picture from Commons not found from Singapore
Open, MediumPublic

Description

Dear all,

The picture Vier Friedenssteine (Kösen).jpg is not accessible for me from Singapore. I can access it if I use VPN with a different country setting but not without. So it might be a problem only with the server responsible for Singapore.

The error message I get is (trying to load https://upload.wikimedia.org/wikipedia/commons/8/8c/Vier_Friedenssteine_%28K%C3%B6sen%29.jpg ):
File not found: /v1/AUTH_mw/wikipedia-commons-local-public.8c/8/8c/Vier_Friedenssteine_%28K%C3%B6sen%29.jpg

I think I recently had another picture with the same problem, so it might not be a unique problem, but could not find another example right now.

Topic was discussed in German at https://de.wikipedia.org/wiki/Wikipedia:Administratoren/Anfragen#Hauptseite but will soon be archived. Nobody else had a problem, but to my knowledge also nobody tried a VPN via Singapore.

Cheers, Fano

Event Timeline

Fano created this task.Aug 23 2019, 2:20 PM
Restricted Application added a project: Operations. · View Herald TranscriptAug 23 2019, 2:20 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Der_Keks updated the task description. (Show Details)Aug 23 2019, 2:22 PM
CDanis edited projects, added SRE-swift-storage; removed ops-eqsin.Aug 23 2019, 2:29 PM
CDanis added a subscriber: CDanis.

Looks like we are indeed serving 404s for this object from codfw and all points west:

✔️ cdanis@cdanis ~ 🕥☕ for DC in esams eqiad codfw ulsfo eqsin ; do echo -ne "$DC\t" ; curl -s -D- --resolve upload.wikimedia.org:443:$(dig +short upload-lb.$DC.wikimedia.org)  https://upload.wikimedia.org/wikipedia/commons/8/8c/Vier_Friedenssteine_%28K%C3%B6sen%29.jpg -o/dev/null | head -n1 ; done 
esams	HTTP/2 200 
eqiad	HTTP/2 200 
codfw	HTTP/2 404 
ulsfo	HTTP/2 404 
eqsin	HTTP/2 404

codfw, ulsfo, eqsin are all returning a cached 404; maybe the object didn't get written to the codfw swift cluster in time?

Indeed, the object exists on eqiad, but never made it to codfw swift:

✔️ cdanis@ms-fe1005.eqiad.wmnet ~ 🕥☕ swift stat wikipedia-commons-local-public.8c '8/8c/Vier_Friedenssteine_(Kösen).jpg'
        Account: AUTH_mw
      Container: wikipedia-commons-local-public.8c
         Object: 8/8c/Vier_Friedenssteine_(Kösen).jpg
   Content Type: image/jpeg
 Content Length: 7874200
  Last Modified: Wed, 24 Jul 2019 14:52:03 GMT
           ETag: 13957dece15374a458b0ac27d324f1b2
Meta Sha1Base36: dva90jw330e3jydea0cd17og7hits7l
  Accept-Ranges: bytes
    X-Timestamp: 1563979922.33473
     X-Trans-Id: txda933052252646cb98182-005d5ffb21
❌cdanis@ms-fe2005.codfw.wmnet ~ 🕥☕ swift stat wikipedia-commons-local-public.8c '8/8c/Vier_Friedenssteine_(Kösen).jpg'  
Object HEAD failed: http://ms-fe.svc.codfw.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.8c/8/8c/Vier_Friedenssteine_%28K%C3%B6sen%29.jpg 404 Not Found
Failed Transaction ID: tx9fc99d906cf34372b23bb-005d5ffb13
aaron added a subscriber: aaron.Aug 23 2019, 3:01 PM

Isn't there a swiftrepl background process to fix this?

Isn't there a swiftrepl background process to fix this?

@fgiunchedi tells me that we turned off swiftrepl once the work in T204245: Run MediaWiki media originals active/active happened to make writes from Mediawiki synchronous writes to all Swift instances.

However that doesn't seem to have been enough here?

Isn't there a swiftrepl background process to fix this?

@fgiunchedi tells me that we turned off swiftrepl once the work in T204245: Run MediaWiki media originals active/active happened to make writes from Mediawiki synchronous writes to all Swift instances.
However that doesn't seem to have been enough here?

I can confirm (chat from wikimedia-traffic IRC) and indeed that doesn't seem to have been sufficient, the next action is going to be a dry run of swiftrepl and see how much reconciliation is needed

aaron added a comment.Aug 23 2019, 3:43 PM

Still, a file was only uploaded, and no other operations done...I'm not sure why the DB would commit if the file store failed in one of the FileBackendMultiwrite backends and 'replication' is 'sync'...

Ammarpad renamed this task from Picture from commons not found from Singapure to Picture from Commons not found from Singapore.Aug 23 2019, 3:45 PM
Restricted Application added a project: Commons. · View Herald TranscriptAug 23 2019, 3:50 PM
Krinkle added a subscriber: Krinkle.EditedAug 23 2019, 4:04 PM

Actual upload was logged at "14:52, 24 July 2019" (UTC).

From https://logstash.wikimedia.org/goto/468571c6470f082f76fa5d24b90a1bf0
Query message:(swift OR "Friedenssteine") under type:mediawiki

  • 2019-07-24T14:51:10 XThwWwpAEDUAADs7m2YAAABK /w/api.php ErrorException from line 1759 of /srv/mediawiki/php-1.34.0-wmf.15/includes/libs/filebackend/FileBackendStore.php: PHP Notice: Unable to delete stat cache for file mwstore://local-swift-eqiad/local-thumb/7/76/#…#/#….#jpg.
  • 2019-07-24T14:51:53 [XThwhQpAEMQAAIt1hpIAAACL] /w/api.php UploadChunkFileException from line 366 of /srv/mediawiki/php-1.34.0-wmf.15/includes/upload/UploadFromChunks.php: Error storing file in '/tmp/Il3c3n': backend-fail-internal; local-swift-codfw
  • 2019-07-24T14:52:25 [XThwpwpAMDsAAAljTfsAAACN] /w/api.php UploadChunkFileException from line 366 of /srv/mediawiki/php-1.34.0-wmf.15/includes/upload/UploadFromChunks.php: Error storing file in '/tmp/uQhx58': backend-fail-internal; local-swift-codfw
  • 2019-07-24T14:52:38 [XThwswpAMDoAAI8PtlYAAABY] /w/api.php UploadChunkFileException from line 366 of /srv/mediawiki/php-1.34.0-wmf.15/includes/upload/UploadFromChunks.php: Error storing file in '/tmp/8X2t8D': backend-fail-internal; local-swift-codfw
  • 2019-07-24T14:52:41 [XThwuApAEDIAAIT5910AAAAS] /w/api.php UploadChunkFileException from line 366 of /srv/mediawiki/php-1.34.0-wmf.15/includes/upload/UploadFromChunks.php: Error storing file in '/tmp/94aKa9': backend-fail-internal; local-swift-codfw

The last ones all have the same trace and meta data more or less:

referer: https://commons.wikimedia.org/wiki/Special:UploadWizard
reqId: XThwhQpAEMQAAIt1hpIAAACL
POST commonswiki /w/api.php

[XThwhQpAEMQAAIt1hpIAAACL] /w/api.php   UploadChunkFileException from line 366 of /srv/mediawiki/php-1.34.0-wmf.15/includes/upload/UploadFromChunks.php: Error storing file in '/tmp/Il3c3n': backend-fail-internal; local-swift-codfw
#0 /srv/mediawiki/php-1.34.0-wmf.15/includes/upload/UploadFromChunks.php(254): UploadFromChunks->outputChunk(string)
#1 /srv/mediawiki/php-1.34.0-wmf.15/includes/api/ApiUpload.php(226): UploadFromChunks->addChunk(string, integer, integer)
#2 /srv/mediawiki/php-1.34.0-wmf.15/includes/api/ApiUpload.php(132): ApiUpload->getChunkResult(array)
#3 /srv/mediawiki/php-1.34.0-wmf.15/includes/api/ApiUpload.php(104): ApiUpload->getContextResult()
#4 /srv/mediawiki/php-1.34.0-wmf.15/includes/api/ApiMain.php(1583): ApiUpload->execute()
#5 /srv/mediawiki/php-1.34.0-wmf.15/includes/api/ApiMain.php(531): ApiMain->executeAction()
#6 /srv/mediawiki/php-1.34.0-wmf.15/includes/api/ApiMain.php(502): ApiMain->executeActionWithErrorHandling()
#7 /srv/mediawiki/php-1.34.0-wmf.15/api.php(86): ApiMain->execute()
#8 /srv/mediawiki/w/api.php(3): include(string)
#9 {main}

See also: T228292: API uploads fatal with UploadChunkFileException: Error storing file in '/tmp' backend-fail-internal

I did some digging in the swift logs around the time the file was uploaded; there's no record of swift in codfw ever receiving a PUT for this file.

There are plenty of other occurrences of similar-looking upload errors in the Mediawiki logs: https://logstash.wikimedia.org/goto/27bc8fe32d4441028fca86d41ae5f535

This is happening often enough I think we should apply some other mitigations in addition to trying to track down what's failing in the active/active configuration.

I suggest:

  • when ATS gets a 404 response from one cluster, retry on the other active cluster (if this is feasible)
  • resuming usage of swiftrepl, ideally with collection of logs so we can see when files needed replication
  • add logging in Mediawiki for which file failed to upload -- AFAICT there's no good way to go from Logstash logs for this error, to something meaningful in the Swift logs (the filename in the MW logs is a meaningless temporary local filename e.g. /tmp/Il3c3n)

BTW for posterity, here's how I looked for logs:

on one of the syslog centralservers (e.g. wezen):

ls /srv/syslog/archive/swift.log-2019072* | sudo nice xargs -L1 -P6 zgrep --line-buffered -F 8c/Vier_Friedenssteine | tee ~/T231086.txt
cat ~/T231086.txt | grep wikipedia-commons-local-public | awk '$4 ~ /^ms-fe/'

wikipedia-commons-local-public is the bucket used for originals (as opposed to thumbnails).

A stupid question: Are there any techniques implemented to synchronize replications or to report faulty replications? This will not be the only error. As I read you got all this information from logs as textfiles...

@Der_Keks yes, that is the purpose of the aforementioned swiftrepl daemon.

Ah okay and that's obviously disabled I understand.

@Aklapper I haven't investigated in any depth, but T231078 looks like a different issue. In that case, the file is not available from any datacenter:

✔️ cdanis@cdanis ~ 🕞🍵 for DC in esams eqiad codfw ulsfo eqsin ; do echo -ne "$DC\t" ; curl -s -D- --resolve upload.wikimedia.org:443:$(dig +short upload-lb.$DC.wikimedia.org)  https://upload.wikimedia.org/wikipedia/commons/a/ab/Chirimena_%281341689114%29.jpg -o/dev/null | head -n1  ; done
esams	HTTP/2 404 
eqiad	HTTP/2 404 
codfw	HTTP/2 404 
ulsfo	HTTP/2 404 
eqsin	HTTP/2 404

Change 532096 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] [DNM] filerepo: make LocalFile*Batch class more robust

https://gerrit.wikimedia.org/r/532096

Joe triaged this task as Medium priority.Aug 26 2019, 8:13 AM
ema moved this task from Triage to Watching on the Traffic board.Aug 27 2019, 10:08 AM

Fortunately this occurrence seems to be quite rare.

On each Swift frontend host, I:

  • grepped today's logs for GETs that resulted in 404s against wikipedia-commons-local-public buckets (originals, not thumbnails).
  • took a random sample of 5000 of those (today there were about 50k-70k 404s per frontend, so total sample is a bit shy of 10%)
  • attempted to fetch the same files from the alternate datacenter.

I found just a handful of cases that 404'd in one place but returned 200 from the other.

In codfw, these files were not found, but returned 200 from eqiad:
200 https://upload.wikimedia.org/wikipedia/commons/7/7c/BMW_Z8_1Y7A6180.jpg
200 https://upload.wikimedia.org/wikipedia/commons/8/88/Wayne-Williams-Mugshot-e1276280330289.jpg
200 https://upload.wikimedia.org/wikipedia/commons/f/f4/LRC-Library_250x150.jpg
200 https://upload.wikimedia.org/wikipedia/commons/a/a3/Fornasetti_Trumeau_Architettura_Courtesy_Archivio_Fornasetti.jpg
200 https://upload.wikimedia.org/wikipedia/commons/a/af/Dudleyseers.jpg

In eqiad, this file was not found, but returned 200 from codfw:
200 https://upload.wikimedia.org/wikipedia/commons/f/f8/Paul_Puustusmaa.jpg

That last file now returns 200 OK from eqiad, interestingly (so maybe the 404 in the logs was just a race condition immediately after upload?)

Anyway tomorrow I will attempt getting swiftrepl running again, which should fix all of these cases.

aaron added a comment.Aug 29 2019, 5:09 AM

I suggest:

  • when ATS gets a 404 response from one cluster, retry on the other active cluster (if this is feasible)

Before any file operation batch, FileBackendMultiWrite::consistencyCheck() will trigger resyncFiles() if something doesn't match. If that also fails, then it will bail before doing anything. Perhaps it could double check consistencyCheck(). Also, I should make it bail if anything gets null from getFileStat() since that is a non-answer.

Change 533143 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] filebackend: make FileBackend consistencyCheck() and resyncFiles() more robust

https://gerrit.wikimedia.org/r/533143

Change 533143 merged by jenkins-bot:
[mediawiki/core@master] filebackend: improve FileBackendMultiWrite consistencyCheck()/resyncFiles()

https://gerrit.wikimedia.org/r/533143

Restricted Application added a project: Structured-Data-Backlog. · View Herald TranscriptSep 5 2019, 8:29 PM
Krenair added a subscriber: Krenair.Sep 5 2019, 8:49 PM
CDanis added a comment.EditedSep 9 2019, 10:43 PM

On each Swift frontend host, I:

  • grepped today's logs for GETs that resulted in 404s against wikipedia-commons-local-public buckets (originals, not thumbnails).
  • took a random sample of 5000 of those (today there were about 50k-70k 404s per frontend, so total sample is a bit shy of 10%)
  • attempted to fetch the same files from the alternate datacenter.

Here's a sample of doing this procedure in codfw. If you want to run it in eqiad, just s/eqiad/codfw/ in curl's --resolve argument.

✔️ cdanis@ms-fe2005.codfw.wmnet ~ 🕡🍺 cat /var/log/swift/proxy-access.log| awk '/wikipedia-commons-local-public/ && / GET / && /HTTP.... 404/ { print $10 }' | tee ~/404-urls.txt.$(date -I) | wc -l
86737
✔️ cdanis@ms-fe2005.codfw.wmnet ~ 🕡🍺 shuf -n2000 ~/404-urls.txt.$(date -I) | sed 's!/v1/AUTH_mw/wikipedia-commons-local-public...!https://upload.wikimedia.org/wikipedia/commons!' | xargs curl -s -X HEAD --resolve upload.wikimedia.org:443:$(dig +short upload-lb.eqiad.wikimedia.org) -w '%{http_code}\t%{url_effective}\n' | grep -v '^404'
200	https://upload.wikimedia.org/wikipedia/commons/d/dc/SkyLab_Isometric_2.tiff
fgiunchedi moved this task from Backlog to Doing on the User-fgiunchedi board.Sep 20 2019, 8:26 AM

swiftrepl is puppetized now to run an eqiad -> codfw sync once a week on Monday (without deletes). I'd like to turn on also codfw -> eqiad synchronization later in the week, assuming that's (always?) safe to do. @aaron what do you think ?

aaron added a comment.EditedOct 30 2019, 3:33 PM

swiftrepl is puppetized now to run an eqiad -> codfw sync once a week on Monday (without deletes).

Are deletes enabled for this sync direction? Is this just for originals?

I'd like to turn on also codfw -> eqiad synchronization later in the week, assuming that's (always?) safe to do. @aaron what do you think ?

So it's highest-timestamp-wins? Is this just for originals? What stops deleted files (moved from -public to -deleted) from coming back to life with this?

As long as we store files under full names, I keep thinking about the idea of using tombstones (on the MediaWiki level, for originals, but not thumbnails/temp stuff). It wouldn't be hard to code (AFAIK), though it would easily distinguish missing vs deleted for cases like this.

Also, in the future, it might be nice to have a MediaWiki operation timestamp added to files for comparison (distinct from the more easily touched/fragile Last-Modified). Alternatively, swift-repl could set X-Timestamp itself (to the time of the source file, when there is one, or tombstone, if we had those).

swiftrepl is puppetized now to run an eqiad -> codfw sync once a week on Monday (without deletes).

Are deletes enabled for this sync direction? Is this just for originals?

We haven't enabled deletes in this direction no. Re: containers, all non-thumbnail containers are synced, namely all of the following (including "nicknames" we're using for regexp on container name)

1 | commons)
    regexp='^wikipedia-commons-local-(public|deleted).[0-9a-z]{2}$'
    ;;
2 | notcommons)
    regexp='^wikipedia-..-local-(public|deleted).[0-9a-z]{2}$'
    ;;
3 | unsharded)
    regexp='^wik[a-z]+-.*-local-(public|deleted)$'
    ;;
4 | global)
    regexp='^global-.*$'
    ;;
5 | timeline)
    regexp='^wik[a-z]+-.*-timeline-render$'
    ;;
6 | transcoded)
    regexp='^wik[a-z]+-.*-transcoded(.[0-9a-f]{2})?$'

I'd like to turn on also codfw -> eqiad synchronization later in the week, assuming that's (always?) safe to do. @aaron what do you think ?

So it's highest-timestamp-wins? Is this just for originals? What stops deleted files (moved from -public to -deleted) from coming back to life with this?

If an object isn't present on the destination, or it is present on the destination but there's an e-tag mismatch then the object will be copied/synced from source to destination. Indeed if a move from -public to -deleted did reach eqiad but not codfw then syncing codfw -> eqiad would bring back the object :| The list of containers would be the same as above (i.e. all but thumbs)

As long as we store files under full names, I keep thinking about the idea of using tombstones (on the MediaWiki level, for originals, but not thumbnails/temp stuff). It wouldn't be hard to code (AFAIK), though it would easily distinguish missing vs deleted for cases like this.

Agreed! I like the idea of using tombstones for these cases.

Also, in the future, it might be nice to have a MediaWiki operation timestamp added to files for comparison (distinct from the more easily touched/fragile Last-Modified). Alternatively, swift-repl could set X-Timestamp itself (to the time of the source file, when there is one, or tombstone, if we had those).

That seems doable to me. At first I was thinking of all additional metadata we'd be storing, although in this scheme thumbnails (which represent the highest object count now) won't be affected (I think?).

At any rate it seems we can't simply enable swiftrepl codfw -> eqiad as is at least now (?)

Gilles moved this task from Inbox to Radar on the Performance-Team board.Nov 4 2019, 8:49 PM
Gilles edited projects, added Performance-Team (Radar); removed Performance-Team.