Descriptionless files (Missing page_latest referential integrity issue)
Closed, ResolvedPublic

Description

A number of files appeared on commons on 15th with no description (may or may not have page entry)
http://commons.wikimedia.org/w/index.php?title=Commons:Administrators%27_noticeboard&oldid=62820408#System_problems

select img_name, img_timestamp, rev_timestamp, page_latest, rev_timestamp - img_timestamp from image left join page on (page_title=img_name and page_namespace=6) left join revision on (reV_page = page_id and rev_parent_id = 0) where img_timestamp >= '20111114233531' limit 7;

+------------------------------------------------+----------------+----------------+-------------+-------------------------------+

img_nameimg_timestamprev_timestamppage_latestrev_timestamp - img_timestamp

+------------------------------------------------+----------------+----------------+-------------+-------------------------------+

La_playAaAa.png2011111423364220111120142847627659185909205(description created later)
Bruxelles_Java_Masque_Wayang_02_10_2011_06.jpg20111114233723NULL0NULL
Raonic_and_Youzhny.jpg20111114233732NULLNULLNULL
Darcy's_gravesite_in_SE_Dijon_(France).jpg2011111423391120111120142313627656285908402(description created later)
Defeater-lost-ground-album-cover.jpg20111114233944NULL0NULL
Jaguar_Sentado.jpg20111114234008NULLNULLNULL
Phil_Bryant.jpg20111114234008NULLNULLNULL

+------------------------------------------------+----------------+----------------+-------------+-------------------------------+

The timing matches with the update of external storage servers https://blog.wikimedia.org/2011/11/18/nobody-notices-when-its-not-broken-new-database-servers-deployed/

I think this is due to the image upload nested transactions, which break the outer transactions. See http://thread.gmane.org/gmane.science.linguistics.wikipedia.technical/49500

When the servers were switched, the second half transaction for those files would have failed and half-rollbacked.

We already suffered this kind of failures in the past, see bug 15430, bug 20744, bug 24978. Seems that each ~12 months we suffer one of those. It should finally be fixed.


Version: 1.23.0
Severity: critical
URL: http://commons.wikimedia.org/w/index.php?title=Commons:Administrators%27_noticeboard&oldid=62820408#System_problems
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=61898
https://bugzilla.wikimedia.org/show_bug.cgi?id=15430

bzimport set Reference to bz32551.
Platonides created this task.Via LegacyNov 21 2011, 6:40 PM
ArielGlenn added a comment.Via ConduitNov 21 2011, 7:05 PM

http://wikitech.wikimedia.org/view/Server_admin_log#November_14 and you can see the times match exactly.

I guess we could disable uploads during such maintenance.

brion added a comment.Via ConduitNov 21 2011, 8:43 PM

Was that the External Storage switchover with the read-only period?

Best is to make sure that doesn't happen by preparing the maintenance so there's still a read/write cluster available to accept saves.

In the case that we get stuck with it anyway, sounds like there's some bad error handling somewhere -- ideally the upload should end up getting rejected and rolled back if part of the file: description page edit fails. Oooh yeah nested transactions, not sure that works.

Platonides added a comment.Via ConduitNov 21 2011, 8:54 PM

Look at the wikitech thread. A second begin() does an implicit commit (with some backends, including mysql) so it's quite untransactional. We start a transaction, the file classes use the db for lock, then doEditPage() has a begin() commit() too...

Kaganer added a comment.Via ConduitDec 26 2011, 6:08 PM

Any pgogress for resolve this bug?

MarkAHershberger added a comment.Via ConduitDec 27 2011, 3:33 PM

(In reply to comment #3)

Look at the wikitech thread. A second begin() does an implicit commit (with
some backends, including mysql) so it's quite untransactional. We start a
transaction, the file classes use the db for lock, then doEditPage() has a
begin() commit() too...

See bug #27283 for this one

Kaganer added a comment.Via ConduitFeb 28 2012, 8:45 AM

Any pgogress for resolve this bug? Please fid & fix all from descriptionless files (with publishing list of these files) by creating temporary file info pages - for further manual filling.

Kaganer added a comment.Via ConduitMar 3 2012, 9:47 PM

See also other side of these same problem: file with first version is non exist. https://bugzilla.wikimedia.org/show_bug.cgi?id=34934

Kaganer added a comment.Via ConduitJun 28 2012, 12:31 PM

Any pgogress for resolve this bug?

bzimport added a comment.Via ConduitAug 7 2012, 9:59 AM

ForoaW wrote:

And the last months, new files have been added to the list. Is there a structural way to identify and to repair them ? Thumbnails and external references seem to remain intact.

Kaganer added a comment.Via ConduitOct 25 2012, 3:03 PM

Soon expire a year from the opening of the bug. Maybe even able to determine the cause of the problem?

Aklapper added a comment.Via ConduitOct 25 2012, 6:30 PM

The cause is likely explained in comment 2.

bzimport added a comment.Via ConduitOct 26 2012, 9:14 AM

ForoaW wrote:

It sems impossible to edit/categorise the "repaired files"
including http://commons.wikimedia.org/wiki/File:Two-stroke_engine_moving_parts_(Montagu,_Cars_and_Motor-Cycles,_1928).svg

Some of the remaining older ones: http://commons.wikimedia.org/wiki/File:POSTERMENDOZA.JPG
http://commons.wikimedia.org/wiki/File:Ogeret.JPG

What to do with the faulty files we detect further on ?

Platonides added a comment.Via ConduitOct 28 2012, 4:58 PM

Interesting, trying to create the page results in an edit conflict. This is new.

These three pages have a page entry, but no associated revision (and page_latest = 0).

There are 16 images like this:
mysql> select page_namespace,page_title from page where page_latest = 0;
+----------------+---------------------------------------------------------------------------+

page_namespacepage_title

+----------------+---------------------------------------------------------------------------+

6Bruxelles_Java_Masque_Wayang_02_10_2011_06.jpg
6POSTERMENDOZA.JPG
6Ogeret.JPG
6SANTA_MARIA_DE_PUIG-AGUILAR_-_7.JPG
6Luftaufnahmen_Nordseekueste_2012_05_D50_by-RaBoe_066.jpg
6Kit_body_brugge1314.png
6Kit_left_arm_union1314.png
6Kit_body_union1314.png
6Kit_right_arm_union1314.png
6Luftaufnahmen_Nordseekueste_2012_05_D50_by-RaBoe_067.jpg
6Two-stroke_engine_moving_parts_(Montagu,_Cars_and_Motor-Cycles,_1928).svg
6Pastel_Raymond_Martin_2.jpg
6Coxeter_diagram_finite_rank4_correspondence.png
6USS_Tempest_PC-2_Crest.png
6RUS-CZE_2012-06-08_pl.svg
6Mérite_national_chevalier_FRANCE.jpg

+----------------+---------------------------------------------------------------------------+

bzimport added a comment.Via ConduitApr 15 2013, 1:12 PM

ForoaW wrote:

Obviously a low priority item: 17 months now.

Jarekt added a comment.Via ConduitMay 24 2013, 11:55 AM

I deleted and reuploded Bruxelles_Java_Masque_Wayang_02_10_2011_06.jpg while trying to fix the impossible to edit file with no license. Should we do the same with the rest?

bzimport added a comment.Via ConduitMay 24 2013, 2:33 PM

ForoaW wrote:

Strange, as the picture is in the system and can be used on wikipedias, such as http://commons.wikimedia.org/wiki/File:POSTERMENDOZA.JPG .

There are several more in http://commons.wikimedia.org/wiki/Special:UnusedFiles and http://commons.wikimedia.org/wiki/Special:UncategorizedFiles.

Bawolff added a comment.Via ConduitFeb 19 2014, 1:55 AM

There are a significant number (~700 a month) of files where during upload an entry in page table is created, but page_latest = 0 (and log_page also seems to be 0). If we just count upload entries with log_page = 0 (Which seems to be symtomatic of the issue, but I haven't 100% verified it can't happen in other situations)

MariaDB [commonswiki_p]> select substr( log_timestamp, 1, 6) as YYYYMM, count(*) as 'uploads missing log_page' from logging_logindex where log_type='upload' and log_timestamp > '20121000000000' and log_page = 0 group by substr( log_timestamp, 1, 6);
+--------+--------------------------+

YYYYMMuploads missing log_page

+--------+--------------------------+

2012101323
201211572
201212624
2013011179
201302666
201303762
201304955
201305637
201306598
201307555
201308818
2013093268
201310808
201311852
201312806
2014011116
201402435

+--------+--------------------------+
17 rows in set (7.08 sec)

This is not a good thing.

hoo added a comment.Via ConduitFeb 23 2014, 1:12 PM
  • Bug 60205 has been marked as a duplicate of this bug. ***
hoo added a comment.Via ConduitFeb 23 2014, 1:44 PM

Increasing priority: This starts to cause noticeable problems on commons

bzimport added a comment.Via ConduitFeb 23 2014, 7:27 PM

raysonlogin wrote:

This bug also affected my file (CASCON2013-2.JPG on Commons) and I was trying to edit the corrupted page but it didn't save.

This bug only affected 1 of my uploads thus far (I've uploaded 1000+ files onto commons), but I voted the importance up as I think the bug can cause major issues.

Rayson

(P.S. This is Rayson from the Open Grid Scheduler / Grid Engine project: http://gridscheduler.sourceforge.net/ - besides free software, we also contribute to free media repos like Wiki Commons.)

Aklapper added a comment.Via ConduitFeb 23 2014, 9:13 PM

(In reply to Marius Hoch from comment #22)

Increasing priority: This starts to cause noticeable problems on commons

Aaron / Sean: Could somebody take a look at this, please?

Magog_the_Ogre added a comment.Via ConduitFeb 24 2014, 12:11 AM

Is the problem related to this? I see no history on the following files on Commons:

File:Admission ticket to Benjamin Rush lecture.jpg
File:Admission ticket to John Morgan lecture.jpg
File:Admission ticket to Caspar Wistar lecture.jpg
File:Papilio anchisiades, mating.jpg

hoo added a comment.Via ConduitFeb 24 2014, 12:29 AM

(In reply to magog.the.ogre from comment #25)

Is the problem related to this? I see no history on the following files on
Commons:

File:Admission ticket to Benjamin Rush lecture.jpg
File:Admission ticket to John Morgan lecture.jpg
File:Admission ticket to Caspar Wistar lecture.jpg
File:Papilio anchisiades, mating.jpg

All these files have (as far as I see) a valid history and a valid value in page_latest. They are just messed redirects... So no, that's not related.

hoo added a comment.Via ConduitFeb 28 2014, 7:54 PM
  • Bug 61898 has been marked as a duplicate of this bug. ***
hoo added a comment.Via ConduitFeb 28 2014, 7:58 PM

Raising priority again (per duplicate bug and ongoing issues)

Aklapper added a comment.Via ConduitFeb 28 2014, 10:05 PM

Note: Dup bug 61898 comment 4 has some investigations by Sean.

Bawolff added a comment.Via ConduitMar 1 2014, 12:50 AM

(In reply to Andre Klapper from comment #29)

Note: Dup bug 61898 comment 4 has some investigations by Sean.

Could you change its component to not be security so we can see? (assuming nothing sensitive is there. This hardly seems like a security issue, but obviously its hard to make that judgement without knowing whats on the bug)

greg added a comment.Via ConduitMar 3 2014, 3:52 PM

(In reply to Bawolff (Brian Wolff) from comment #30)

(In reply to Andre Klapper from comment #29)
> Note: Dup bug 61898 comment 4 has some investigations by Sean.

Could you change its component to not be security so we can see? (assuming
nothing sensitive is there. This hardly seems like a security issue, but
obviously its hard to make that judgement without knowing whats on the bug)

(Using Chatham House Rules ;) )

The reasoning for filing it under Security was given in bug 61898 comment 0:

"I'm filing this under Security to keep it on a low radar. They're innocent to keep around, but in my experience filing these publicly may cause them to disappear at some point when too many people spread the link without proper context or try to poke at it."

No comment on the validity of the reasoning, but there are URLs in that bug that are not on this one.

RobLa-WMF added a comment.Via ConduitMar 20 2014, 9:58 PM

Hi Gilles, could you insert this into the planning process for the next sprint?

gerritbot added a comment.Via ConduitMar 21 2014, 6:02 AM

Change 119932 had a related patch set uploaded by Brian Wolff:
Rollback transactions if job throws an exception.

https://gerrit.wikimedia.org/r/119932

Bawolff added a comment.Via ConduitMar 21 2014, 6:37 AM

I was finally able to reproduce this.

Steps to reproduce:
*Make sure $wgEnableAsyncUploads = true;
*Artificially make it so that Revision::insertOn throws some sort of exception. My current theory is that maybe very intermitent issue with external store dbs is triggering the "Unable to store text to external storage" exception in ExternalStore::insertWithFallback. That's just a random guess based on the proximity of one approximately the db inserts fail. Could be totally wrong. Anyway, its probably somewhere around that block of code. Someone with access to job queue log could confirm if there are any common exceptions in the log, and look at the results of the upload job for the various example files in question.
*Upload an image using the api and stashed upload, making sure that the async option is specified.
*Do runJobs.php --type PublishStashedFile to publish the file. This should return an error from the exception triggered via step 2

Actual behaviour:
*Locally (I don't know about commons), the upload api does correctly report a stash failure error to the user. Which is as it should be
*Image does not show up until page is action=purge or ~24hours, as the negative entry in file memcache is not cleared
*No RC entry (Although I did notice a gap in the rc_id field which is kind of weird, maybe coincidence)
*No dummy edit in page history
*Log entry is missing log_page
*Page table entry is inserted in DB, however page_latest is set to 0, which is a referential integrity violation and should never happen [The big bad of this bug]
*Page contents is missing, and replaced with missing revision error. Page cannot be edited (Get edit conflict warning). Only way to fix page is to re-upload a new image over it or delete and undelete the page.

Expected behaviour:
*Well you know, upload just works properly ;)
*Referential integrity should never ever be violated. There should never be an uneditable page.
*Obviously a better behaviour would be for the page to be blank instead of "broken", however that's probably still a "bad" behaviour. Ideally we would want the users text to always be on the page. Its unclear to me if we would prefer that the image totally go away if we can't edit the page properly (ie the operation totally be atomic), or if the current behaviour of the file being saved is preferable (probably. Less data loss the better).
*Cache should be cleared so that images don't randomly appear 24 hours after the fact.

(In reply to Gerrit Notification Bot from comment #33)

Change 119932 had a related patch set uploaded by Brian Wolff:
Rollback transactions if job throws an exception.

https://gerrit.wikimedia.org/r/119932

If I'm correct about the cause (Which as it stands is just a guess), and the patch actually works, what it will cause is that pages experiencing this bug won't be in a "broken" state (ref integrity failure), but just be blank, so users can create them later as they please. This is not a full fix to the issue, we still need to figure out what situation is causing the issue, and reduce its occurrence (Could someone look at job queue logs for one of the example files like File:ContentLanguage.svg if logs go back that far, and see what the last error was?). We also may want to try to catch whatever exception is happening, and retry the operation, if we're assuming its a very intermittent issue.

The reasoning for filing it under Security was given in bug 61898 comment 0:

"I'm filing this under Security to keep it on a low radar. They're innocent
to keep around, but in my experience filing these publicly may cause them to
disappear at some point when too many people spread the link without proper
context or try to poke at it."

No comment on the validity of the reasoning, but there are URLs in that bug
that are not on this one.

That seems unnecessary given how often we're getting new examples - Most recently [[commons:File:ContentLanguage.svg]] which was uploaded a week ago (Earlier we seemed to be getting new examples every day). Commons admins also have db access via tool labs and are more than capable of getting a list of all affected files themselves if they were so inclined.

Springle added a comment.Via ConduitMar 25 2014, 12:22 PM

Possible that bug 63058 is related. Unproven, but that one was certainly blocking some commits on commonswiki today.

Bawolff added a comment.Via ConduitMar 25 2014, 3:02 PM

(In reply to Sean Pringle from comment #35)

Possible that bug 63058 is related. Unproven, but that one was certainly
blocking some commits on commonswiki today.

Blocking the commit wouldnt cause this since (from what i understand) that would cause a rollback once php timed out. This bug needs to have a COMMIT issued in the middle of a transaction to cause an inconsistent state.

Bawolff added a comment.Via ConduitApr 5 2014, 9:12 PM
  • Bug 40178 has been marked as a duplicate of this bug. ***
gerritbot added a comment.Via ConduitApr 6 2014, 1:18 AM

Change 124135 had a related patch set uploaded by Brian Wolff:
Make chunked upload jobs robust in face of exceptions.

https://gerrit.wikimedia.org/r/124135

gerritbot added a comment.Via ConduitApr 6 2014, 1:53 AM

Change 124136 had a related patch set uploaded by Brian Wolff:
When uploading a new file, save to memcached directly after commit

https://gerrit.wikimedia.org/r/124136

gerritbot added a comment.Via ConduitApr 6 2014, 2:22 AM

Change 124137 had a related patch set uploaded by Brian Wolff:
Make doEditContent call $dbw->rollback() if exception happens

https://gerrit.wikimedia.org/r/124137

Bawolff added a comment.Via ConduitApr 6 2014, 3:50 AM

So the patch in comment 38 or the patch in comment 40 will stop the "Missing revision #0 error" on new files. Instead when that situation happens, it will cause the page to just not be created, which is an improvement over the page becoming corrupt. (Either patch will fix the issue. Both together make things even more robust)

The patch in comment 39 would make it so that when this issue happens, the image appears immediately instead of 24 hours later due to stale negative memcache entry.

So what remains to be done for this bug (Other then review of those patches):
*Figure out root cause of edits not going through
**Most recent example (for a file) is [[commons:File:Stephens_college_at_the_corner_of_Waugh_and_Broadway_.jpg]] uploaded at approximately 2014-03-30 00:24:12. It would be helpful if someone who had access to job queue logs could grep through for a PublishStashedFile job at about that time, for that title, and see what the jobs error output is.

*Figure out if there is something that should be done about root cause (which really depends on what it is)

*Figure out if it makes sense in the face of an edit failure to simply try and doEditContent() a second time(?) Or if there is some other behaviour that makes sense to do in the case where we try to edit a page and it doesn't work.

gerritbot added a comment.Via ConduitApr 6 2014, 9:12 AM

Change 124137 merged by jenkins-bot:
Make doEditContent call $dbw->rollback() if exception happens

https://gerrit.wikimedia.org/r/124137

gerritbot added a comment.Via ConduitApr 6 2014, 9:26 AM

Change 124136 merged by jenkins-bot:
When uploading a new file, save to memcached directly after commit

https://gerrit.wikimedia.org/r/124136

gerritbot added a comment.Via ConduitApr 6 2014, 9:26 AM

Change 124135 merged by jenkins-bot:
Make chunked upload jobs robust in face of exceptions.

https://gerrit.wikimedia.org/r/124135

Bawolff added a comment.Via ConduitApr 9 2014, 1:46 AM

Resetting for new for now, until we confirm whether this is fixed or not.

Aaron tried to look up [[commons:File:Rancho_Camulos_National_Historic_Landmark_Plaque.jpg]] in the jobqueue log for me, but it wasn't in the log. Which first of all confused me, and then caused me to look at UploadWizard code, and realize it only uploads via job queue if file is > 10 MB. So I guess a lot of the errors weren't coming from exceptions thrown by the upload job like I previously thought, but perhaps just directly through the api (Although perhaps a small number came through job classes).

Which means that anomie's patch dc7d342d93b12 (which was deployed to commons today, april 8 at 21:12), may actually fix this issue. The patch in comment 42 (not yet deployed) may also equally fix the issue. Anyways, guess we should watch to see if these files stop occurring. The most recent example I can find is from today at 17:20 ( [[commons:file:ABS-3401.0-OverseasArrivalsDeparturesAustralia-ShorttermMovementResidentDepartures_SelectedDestinations-Trend-NumberMovements-Cambodia-A83808853V.svg]] ), so, so far so good. Not that that means much yet as in the past there have been gaps of a week between such files sometimes.


There's also the separate issue of what's causing the underlying error. There is still going to be files without a description page, they just wont have the "revision 0 is missing" error any more. We should figure out what's up with that. Perhaps that should be split to a separate bug (?)

Aklapper added a comment.Via ConduitMay 5 2014, 4:23 PM

Platonides, Paul, Yann, Trijnstel: Do you know if this has been still a problem in the last three weeks?

Steinsplitter added a comment.Via ConduitMay 6 2014, 4:08 PM

(In reply to Andre Klapper from comment #46)

Platonides, Paul, Yann, Trijnstel: Do you know if this has been still a
problem in the last three weeks?

Unfortunately yes.

Bawolff added a comment.Via ConduitMay 14 2014, 10:58 PM

(In reply to Steinsplitter from comment #47)

(In reply to Andre Klapper from comment #46)
> Platonides, Paul, Yann, Trijnstel: Do you know if this has been still a
> problem in the last three weeks?

Unfortunately yes.

I haven't been able to find any recent examples in the db. Has there been any new files with this issue in say the last month (or even in the later part of april). If so please include examples.

Aklapper added a comment.Via ConduitMay 25 2014, 3:14 PM

(In reply to Bawolff (Brian Wolff) from comment #48)

I haven't been able to find any recent examples in the db. Has there been
any new files with this issue in say the last month (or even in the later
part of april). If so please include examples.

Replies welcome.

Aklapper added a comment.Via ConduitJun 5 2014, 2:45 PM

Reducing priority until somebody can provide recent examples.
See comment 48, comment 49.

Bawolff added a comment.Via ConduitJun 5 2014, 2:56 PM

Calling fixed. I think comment 44 + Anomie's api patch fixed the issue. If anyone encounters more examples please speak up.

Bawolff added a comment.Via ConduitJun 9 2014, 2:40 AM

btw, splitting off larger issue of sometimes no description gets posted, to bug 66355.

Add Comment

Column Prototype
This is a very early prototype of a persistent column. It is not expected to work yet, and leaving it open will activate other new features which will break things. Press "\" (backslash) on your keyboard to close it now.