Page MenuHomePhabricator

Descriptionless files (Missing page_latest referential integrity issue)
Closed, ResolvedPublic

Description

A number of files appeared on commons on 15th with no description (may or may not have page entry)
http://commons.wikimedia.org/w/index.php?title=Commons:Administrators%27_noticeboard&oldid=62820408#System_problems

select img_name, img_timestamp, rev_timestamp, page_latest, rev_timestamp - img_timestamp from image left join page on (page_title=img_name and page_namespace=6) left join revision on (reV_page = page_id and rev_parent_id = 0) where img_timestamp >= '20111114233531' limit 7;

+------------------------------------------------+----------------+----------------+-------------+-------------------------------+

img_nameimg_timestamprev_timestamppage_latestrev_timestamp - img_timestamp

+------------------------------------------------+----------------+----------------+-------------+-------------------------------+

La_playAaAa.png2011111423364220111120142847627659185909205(description created later)
Bruxelles_Java_Masque_Wayang_02_10_2011_06.jpg20111114233723NULL0NULL
Raonic_and_Youzhny.jpg20111114233732NULLNULLNULL
Darcy's_gravesite_in_SE_Dijon_(France).jpg2011111423391120111120142313627656285908402(description created later)
Defeater-lost-ground-album-cover.jpg20111114233944NULL0NULL
Jaguar_Sentado.jpg20111114234008NULLNULLNULL
Phil_Bryant.jpg20111114234008NULLNULLNULL

+------------------------------------------------+----------------+----------------+-------------+-------------------------------+

The timing matches with the update of external storage servers https://blog.wikimedia.org/2011/11/18/nobody-notices-when-its-not-broken-new-database-servers-deployed/

I think this is due to the image upload nested transactions, which break the outer transactions. See http://thread.gmane.org/gmane.science.linguistics.wikipedia.technical/49500

When the servers were switched, the second half transaction for those files would have failed and half-rollbacked.

We already suffered this kind of failures in the past, see bug 15430, bug 20744, bug 24978. Seems that each ~12 months we suffer one of those. It should finally be fixed.


Version: 1.23.0
Severity: critical
URL: http://commons.wikimedia.org/w/index.php?title=Commons:Administrators%27_noticeboard&oldid=62820408#System_problems
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=61898
https://bugzilla.wikimedia.org/show_bug.cgi?id=15430

Details

Reference
bz32551

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 12:04 AM
bzimport set Reference to bz32551.

http://wikitech.wikimedia.org/view/Server_admin_log#November_14 and you can see the times match exactly.

I guess we could disable uploads during such maintenance.

Was that the External Storage switchover with the read-only period?

Best is to make sure that doesn't happen by preparing the maintenance so there's still a read/write cluster available to accept saves.

In the case that we get stuck with it anyway, sounds like there's some bad error handling somewhere -- ideally the upload should end up getting rejected and rolled back if part of the file: description page edit fails. Oooh yeah nested transactions, not sure that works.

Look at the wikitech thread. A second begin() does an implicit commit (with some backends, including mysql) so it's quite untransactional. We start a transaction, the file classes use the db for lock, then doEditPage() has a begin() commit() too...

Any pgogress for resolve this bug?

(In reply to comment #3)

Look at the wikitech thread. A second begin() does an implicit commit (with
some backends, including mysql) so it's quite untransactional. We start a
transaction, the file classes use the db for lock, then doEditPage() has a
begin() commit() too...

See bug #27283 for this one

Any pgogress for resolve this bug? Please fid & fix all from descriptionless files (with publishing list of these files) by creating temporary file info pages - for further manual filling.

See also other side of these same problem: file with first version is non exist. https://bugzilla.wikimedia.org/show_bug.cgi?id=34934

Any pgogress for resolve this bug?

ForoaW wrote:

And the last months, new files have been added to the list. Is there a structural way to identify and to repair them ? Thumbnails and external references seem to remain intact.

Soon expire a year from the opening of the bug. Maybe even able to determine the cause of the problem?

The cause is likely explained in comment 2.

Interesting, trying to create the page results in an edit conflict. This is new.

These three pages have a page entry, but no associated revision (and page_latest = 0).

There are 16 images like this:
mysql> select page_namespace,page_title from page where page_latest = 0;
+----------------+---------------------------------------------------------------------------+

page_namespacepage_title

+----------------+---------------------------------------------------------------------------+

6Bruxelles_Java_Masque_Wayang_02_10_2011_06.jpg
6POSTERMENDOZA.JPG
6Ogeret.JPG
6SANTA_MARIA_DE_PUIG-AGUILAR_-_7.JPG
6Luftaufnahmen_Nordseekueste_2012_05_D50_by-RaBoe_066.jpg
6Kit_body_brugge1314.png
6Kit_left_arm_union1314.png
6Kit_body_union1314.png
6Kit_right_arm_union1314.png
6Luftaufnahmen_Nordseekueste_2012_05_D50_by-RaBoe_067.jpg
6Two-stroke_engine_moving_parts_(Montagu,_Cars_and_Motor-Cycles,_1928).svg
6Pastel_Raymond_Martin_2.jpg
6Coxeter_diagram_finite_rank4_correspondence.png
6USS_Tempest_PC-2_Crest.png
6RUS-CZE_2012-06-08_pl.svg
6Mérite_national_chevalier_FRANCE.jpg

+----------------+---------------------------------------------------------------------------+

ForoaW wrote:

Obviously a low priority item: 17 months now.

I deleted and reuploded Bruxelles_Java_Masque_Wayang_02_10_2011_06.jpg while trying to fix the impossible to edit file with no license. Should we do the same with the rest?

There are a significant number (~700 a month) of files where during upload an entry in page table is created, but page_latest = 0 (and log_page also seems to be 0). If we just count upload entries with log_page = 0 (Which seems to be symtomatic of the issue, but I haven't 100% verified it can't happen in other situations)

MariaDB [commonswiki_p]> select substr( log_timestamp, 1, 6) as YYYYMM, count(*) as 'uploads missing log_page' from logging_logindex where log_type='upload' and log_timestamp > '20121000000000' and log_page = 0 group by substr( log_timestamp, 1, 6);
+--------+--------------------------+

YYYYMMuploads missing log_page

+--------+--------------------------+

2012101323
201211572
201212624
2013011179
201302666
201303762
201304955
201305637
201306598
201307555
201308818
2013093268
201310808
201311852
201312806
2014011116
201402435

+--------+--------------------------+
17 rows in set (7.08 sec)

This is not a good thing.

  • Bug 60205 has been marked as a duplicate of this bug. ***

Increasing priority: This starts to cause noticeable problems on commons

raysonlogin wrote:

This bug also affected my file (CASCON2013-2.JPG on Commons) and I was trying to edit the corrupted page but it didn't save.

This bug only affected 1 of my uploads thus far (I've uploaded 1000+ files onto commons), but I voted the importance up as I think the bug can cause major issues.

Rayson

(P.S. This is Rayson from the Open Grid Scheduler / Grid Engine project: http://gridscheduler.sourceforge.net/ - besides free software, we also contribute to free media repos like Wiki Commons.)

(In reply to Marius Hoch from comment #22)

Increasing priority: This starts to cause noticeable problems on commons

Aaron / Sean: Could somebody take a look at this, please?

Is the problem related to this? I see no history on the following files on Commons:

File:Admission ticket to Benjamin Rush lecture.jpg
File:Admission ticket to John Morgan lecture.jpg
File:Admission ticket to Caspar Wistar lecture.jpg
File:Papilio anchisiades, mating.jpg

(In reply to magog.the.ogre from comment #25)

Is the problem related to this? I see no history on the following files on
Commons:

File:Admission ticket to Benjamin Rush lecture.jpg
File:Admission ticket to John Morgan lecture.jpg
File:Admission ticket to Caspar Wistar lecture.jpg
File:Papilio anchisiades, mating.jpg

All these files have (as far as I see) a valid history and a valid value in page_latest. They are just messed redirects... So no, that's not related.

  • Bug 61898 has been marked as a duplicate of this bug. ***

Raising priority again (per duplicate bug and ongoing issues)

Note: Dup bug 61898 comment 4 has some investigations by Sean.

(In reply to Andre Klapper from comment #29)

Note: Dup bug 61898 comment 4 has some investigations by Sean.

Could you change its component to not be security so we can see? (assuming nothing sensitive is there. This hardly seems like a security issue, but obviously its hard to make that judgement without knowing whats on the bug)

(In reply to Bawolff (Brian Wolff) from comment #30)

(In reply to Andre Klapper from comment #29)

Note: Dup bug 61898 comment 4 has some investigations by Sean.

Could you change its component to not be security so we can see? (assuming
nothing sensitive is there. This hardly seems like a security issue, but
obviously its hard to make that judgement without knowing whats on the bug)

(Using Chatham House Rules ;) )

The reasoning for filing it under Security was given in bug 61898 comment 0:

"I'm filing this under Security to keep it on a low radar. They're innocent to keep around, but in my experience filing these publicly may cause them to disappear at some point when too many people spread the link without proper context or try to poke at it."

No comment on the validity of the reasoning, but there are URLs in that bug that are not on this one.

Hi Gilles, could you insert this into the planning process for the next sprint?

Change 119932 had a related patch set uploaded by Brian Wolff:
Rollback transactions if job throws an exception.

https://gerrit.wikimedia.org/r/119932

I was finally able to reproduce this.

Steps to reproduce:
*Make sure $wgEnableAsyncUploads = true;
*Artificially make it so that Revision::insertOn throws some sort of exception. My current theory is that maybe very intermitent issue with external store dbs is triggering the "Unable to store text to external storage" exception in ExternalStore::insertWithFallback. That's just a random guess based on the proximity of one approximately the db inserts fail. Could be totally wrong. Anyway, its probably somewhere around that block of code. Someone with access to job queue log could confirm if there are any common exceptions in the log, and look at the results of the upload job for the various example files in question.
*Upload an image using the api and stashed upload, making sure that the async option is specified.
*Do runJobs.php --type PublishStashedFile to publish the file. This should return an error from the exception triggered via step 2

Actual behaviour:
*Locally (I don't know about commons), the upload api does correctly report a stash failure error to the user. Which is as it should be
*Image does not show up until page is action=purge or ~24hours, as the negative entry in file memcache is not cleared
*No RC entry (Although I did notice a gap in the rc_id field which is kind of weird, maybe coincidence)
*No dummy edit in page history
*Log entry is missing log_page
*Page table entry is inserted in DB, however page_latest is set to 0, which is a referential integrity violation and should never happen [The big bad of this bug]
*Page contents is missing, and replaced with missing revision error. Page cannot be edited (Get edit conflict warning). Only way to fix page is to re-upload a new image over it or delete and undelete the page.

Expected behaviour:
*Well you know, upload just works properly ;)
*Referential integrity should never ever be violated. There should never be an uneditable page.
*Obviously a better behaviour would be for the page to be blank instead of "broken", however that's probably still a "bad" behaviour. Ideally we would want the users text to always be on the page. Its unclear to me if we would prefer that the image totally go away if we can't edit the page properly (ie the operation totally be atomic), or if the current behaviour of the file being saved is preferable (probably. Less data loss the better).
*Cache should be cleared so that images don't randomly appear 24 hours after the fact.

(In reply to Gerrit Notification Bot from comment #33)

Change 119932 had a related patch set uploaded by Brian Wolff:
Rollback transactions if job throws an exception.

https://gerrit.wikimedia.org/r/119932

If I'm correct about the cause (Which as it stands is just a guess), and the patch actually works, what it will cause is that pages experiencing this bug won't be in a "broken" state (ref integrity failure), but just be blank, so users can create them later as they please. This is not a full fix to the issue, we still need to figure out what situation is causing the issue, and reduce its occurrence (Could someone look at job queue logs for one of the example files like File:ContentLanguage.svg if logs go back that far, and see what the last error was?). We also may want to try to catch whatever exception is happening, and retry the operation, if we're assuming its a very intermittent issue.

The reasoning for filing it under Security was given in bug 61898 comment 0:

"I'm filing this under Security to keep it on a low radar. They're innocent
to keep around, but in my experience filing these publicly may cause them to
disappear at some point when too many people spread the link without proper
context or try to poke at it."

No comment on the validity of the reasoning, but there are URLs in that bug
that are not on this one.

That seems unnecessary given how often we're getting new examples - Most recently [[commons:File:ContentLanguage.svg]] which was uploaded a week ago (Earlier we seemed to be getting new examples every day). Commons admins also have db access via tool labs and are more than capable of getting a list of all affected files themselves if they were so inclined.

Possible that bug 63058 is related. Unproven, but that one was certainly blocking some commits on commonswiki today.

(In reply to Sean Pringle from comment #35)

Possible that bug 63058 is related. Unproven, but that one was certainly
blocking some commits on commonswiki today.

Blocking the commit wouldnt cause this since (from what i understand) that would cause a rollback once php timed out. This bug needs to have a COMMIT issued in the middle of a transaction to cause an inconsistent state.

  • Bug 40178 has been marked as a duplicate of this bug. ***

Change 124135 had a related patch set uploaded by Brian Wolff:
Make chunked upload jobs robust in face of exceptions.

https://gerrit.wikimedia.org/r/124135

Change 124136 had a related patch set uploaded by Brian Wolff:
When uploading a new file, save to memcached directly after commit

https://gerrit.wikimedia.org/r/124136

Change 124137 had a related patch set uploaded by Brian Wolff:
Make doEditContent call $dbw->rollback() if exception happens

https://gerrit.wikimedia.org/r/124137

So the patch in comment 38 or the patch in comment 40 will stop the "Missing revision #0 error" on new files. Instead when that situation happens, it will cause the page to just not be created, which is an improvement over the page becoming corrupt. (Either patch will fix the issue. Both together make things even more robust)

The patch in comment 39 would make it so that when this issue happens, the image appears immediately instead of 24 hours later due to stale negative memcache entry.

So what remains to be done for this bug (Other then review of those patches):
*Figure out root cause of edits not going through
**Most recent example (for a file) is [[commons:File:Stephens_college_at_the_corner_of_Waugh_and_Broadway_.jpg]] uploaded at approximately 2014-03-30 00:24:12. It would be helpful if someone who had access to job queue logs could grep through for a PublishStashedFile job at about that time, for that title, and see what the jobs error output is.

*Figure out if there is something that should be done about root cause (which really depends on what it is)

*Figure out if it makes sense in the face of an edit failure to simply try and doEditContent() a second time(?) Or if there is some other behaviour that makes sense to do in the case where we try to edit a page and it doesn't work.

Change 124137 merged by jenkins-bot:
Make doEditContent call $dbw->rollback() if exception happens

https://gerrit.wikimedia.org/r/124137

Change 124136 merged by jenkins-bot:
When uploading a new file, save to memcached directly after commit

https://gerrit.wikimedia.org/r/124136

Change 124135 merged by jenkins-bot:
Make chunked upload jobs robust in face of exceptions.

https://gerrit.wikimedia.org/r/124135

Resetting for new for now, until we confirm whether this is fixed or not.

Aaron tried to look up [[commons:File:Rancho_Camulos_National_Historic_Landmark_Plaque.jpg]] in the jobqueue log for me, but it wasn't in the log. Which first of all confused me, and then caused me to look at UploadWizard code, and realize it only uploads via job queue if file is > 10 MB. So I guess a lot of the errors weren't coming from exceptions thrown by the upload job like I previously thought, but perhaps just directly through the api (Although perhaps a small number came through job classes).

Which means that anomie's patch dc7d342d93b12 (which was deployed to commons today, april 8 at 21:12), may actually fix this issue. The patch in comment 42 (not yet deployed) may also equally fix the issue. Anyways, guess we should watch to see if these files stop occurring. The most recent example I can find is from today at 17:20 ( [[commons:file:ABS-3401.0-OverseasArrivalsDeparturesAustralia-ShorttermMovementResidentDepartures_SelectedDestinations-Trend-NumberMovements-Cambodia-A83808853V.svg]] ), so, so far so good. Not that that means much yet as in the past there have been gaps of a week between such files sometimes.


There's also the separate issue of what's causing the underlying error. There is still going to be files without a description page, they just wont have the "revision 0 is missing" error any more. We should figure out what's up with that. Perhaps that should be split to a separate bug (?)

Platonides, Paul, Yann, Trijnstel: Do you know if this has been still a problem in the last three weeks?

(In reply to Andre Klapper from comment #46)

Platonides, Paul, Yann, Trijnstel: Do you know if this has been still a
problem in the last three weeks?

Unfortunately yes.

(In reply to Steinsplitter from comment #47)

(In reply to Andre Klapper from comment #46)

Platonides, Paul, Yann, Trijnstel: Do you know if this has been still a
problem in the last three weeks?

Unfortunately yes.

I haven't been able to find any recent examples in the db. Has there been any new files with this issue in say the last month (or even in the later part of april). If so please include examples.

(In reply to Bawolff (Brian Wolff) from comment #48)

I haven't been able to find any recent examples in the db. Has there been
any new files with this issue in say the last month (or even in the later
part of april). If so please include examples.

Replies welcome.

Reducing priority until somebody can provide recent examples.
See comment 48, comment 49.

Calling fixed. I think comment 44 + Anomie's api patch fixed the issue. If anyone encounters more examples please speak up.

btw, splitting off larger issue of sometimes no description gets posted, to bug 66355.