Page MenuHomePhabricator

[Regression] File deletions don't always purge everything
Closed, ResolvedPublic

Description

On Commons, nowadays we sometimes notice that after a file deletion via API, the file-revision-table remains visible and links to the files are not rendered as redlinks. Purging helps but if you delete, you do not expect having to go to the page and purge it.

I will try to get some screenshots.


Version: unspecified
Severity: critical

Details

Reference
bz35047

Event Timeline

bzimport raised the priority of this task from to Unbreak Now!.Nov 22 2014, 12:17 AM
bzimport set Reference to bz35047.

Not just the API.

For the past hour I've been working through a backlog of copyright violations on Commons based on this category:

https://commons.wikimedia.org/wiki/Category:Copyright_violations

All 3 files I deleted in the last 5 minutes however are different, something weird happened (or rather, didn't happen). The category member links were kept and after refreshing they're still showing up in the category (although the thumbnail is broken, and opening the link shows a "This page was recently deleted" message).

[[commons:File:Re7.png]], [[commons:File:Re5.png]], [[commons:File:Re3.png]]

Looks like after purging the category page and file pages, they were removed from the category.

It shouldn't be possible for a non-existent page to be categorized :D

The POST request submitting action=delete sometimes takes about to 2-3 seconds to respond.

For me it sometimes takes up to 10s but I am on the "wrong server" (secure.wikimedia.org)

wiki wrote:

Until around the middle of February, I could use DelReqHandler to delete all of the files in a Mass DR in about ten seconds -- hit the {del} link on each file, then hit Enter to accept the default comment, and so on. They would all go red very quickly.

I am currently working on
http://commons.wikimedia.org/wiki/Commons:Deletion_requests/Files_in_Category:Commemorative_euro_coins_%28France%29#Files_in_Category:Commemorative_euro_coins_.28France.29

After two minutes, the first five files are still highlighted in gray with "working". The last six are all red. If recent experience is true here, they will remain "working" indefinitely -- I once left them overnight without change.

If I then reload the DR page, several things can happen.

  1. In this case, they all show up red. This is true less than half the time
  2. In most cases some or all will will remain blue.

2a) Then if I go to the file page, in some cases the file will be intact.
2b) In other cases, the file will appear intact, but the pulldown will start with "Undelete", showing that the file is actually gone. In that case, using "Purge" on the pulldown will turn it red on the DR page.
2c) in still other cases, the file page will turn up with "That file has been deleted..."

Since there are more than 100 DRs on an average day (more than 200 on today's targets), this is seriously slowing down closures.

Jim

Jameslwoodward

Are any 500 type errors or timeouts occurring?

(In reply to comment #6)

Are any 500 type errors or timeouts occurring?

Not that I know of.

See also 35054

wiki wrote:

Reply to Comment #6:

Occasionally, perhaps one in twenty deletes:

"API request returned code 504 errorError code is Gateway Time-out"

Sometimes just one, sometimes multiple times while doing a multiple delete.

Sometimes the process completes correctly after a 504 error; sometimes some of them hang. Five of 13 deletes are hung now at Commons:Deletion requests/The Boy in the Plastic Bubble (film). The other eight are now red-links.

It happens more often when more than three or four deletes are running at once.

Jim

wiki wrote:

Yes.
I just deleted

File:Alter Way 50 x 109.jpg
File:Reciprok logo 50 x 109.jpg
File:AWF 50 x 109.jpg
File:AWS LOGO 50 x 109.jpg [keep] [del]
File:AWH 50x109.jpg [keep] [del]

After a page reload, the first three are red, the last two are blue, although they have both been deleted.

Clicking on

File:AWS LOGO 50 x 109.jpg [keep] [del]

yields: "No file by this name exists, but you can upload it."

while

File:AWH 50 x 109.jpg [keep] [del]

yields a page that shows the thumbnail and asks "View or restore 4 deleted edits?"

On

File:Badingchad.jpg [keep] [del]
File:Jamesmadara.jpg [keep] [del]
File:Lovetriangle.jpg [keep] [del]
File:Gakuranboys.jpg [keep] [del]

the first one is still blue, but actually deleted. The other three were not deleted.

Jim

How does the DelReqHandler script check the return status of the deletions. I assume it uses the API. Did it get a bad response for the files that where not deleted? Would it properly detect if it just got an error message back or treat it as success?

In any case there is definitely a cache invalidation problem going on. It could be a missing purge call, an HTMLCacheUpdate job bug, or purge request packet loss.

I wish I had a link or screenshot to "File:AWH 50 x 109.jpg" since that one sounds the strangest. Does it also show the file history?

wiki wrote:

I'm a user, not the programmer, so I can't answer technical questions, but, yes, I'm pretty sure it uses the API.

It's a transitory phenomena, so a link won't help, but the next time I see a page like File:AWH 50 x 109.jpg I'll shoot a screen shot and e-mail it to you.


It occurs to me -- and I'm guessing above my pay grade here -- that the servers appear to have generally slowed down over the last six weeks or so. Is it possible that this is a stack or queue getting overloaded, that would not have been overloaded when the servers were moving much faster in December, so that hitting a bunch of quick (del) would have been cleared before the overload occurred? Certainly even a single (del) takes much longer now than it did then.

Jim

(In reply to comment #11)

Would it properly detect if it just got an error message back or treat

it as success?
The the former one.

https://secure.wikimedia.org/wikipedia/commons/wiki/MediaWiki:Gadget-DelReqHandler.js

Deletion
deleteFile -> doAPICall -> $.ajax -> error/success (would throw an alert-box on error) (both HTTP-status errors and result.error (like bad token) are handled)

I just noticed a similar problem, using the regular interface and while deleting just one file [1]. The deletion took considerable time and delivered following error message:


Database error
Jump to: navigation, search
A database query syntax error has occurred. This may indicate a bug in the software. The last attempted database query was:

(SQL query hidden)

from within function "LocalFile::delete". Database returned error "1205: Lock wait timeout exceeded; try restarting transaction (10.0.6.41)".

Like in the case above, the links to that file remained blue and the
at the file description page, the thumbnail along with the file history
and the metadata were still to be seen. Purging helped but took also
considerable time.

Andreas.

[1] http://commons.wikimedia.org/wiki/File:Nude_man_standing_with_an_erection..jpg

The browser version used to view the category page after deletion may be relevant. Anyone want to say what browser version they were using?

wiki wrote:

Firefox 10.0.1 running on XP Pro

Jim

There are a few different issues here.

(In reply to comment #0)

On Commons, nowadays we sometimes notice that after a file deletion via API,
the file-revision-table remains visible and links to the files are not rendered
as redlinks. Purging helps but if you delete, you do not expect having to go to
the page and purge it.

The slave DB is used to check page_touched in response to an IMS request, so if the file description page looks identical to how it looked before delete, then slave lag coupled with session corruption is a possibility.

If the description page had no description text, but the "file history" section was still present, then that would point to an incomplete delete operation. In my test install for a file deletion, I count four separate transactions with write queries in them. The deletion of the image row happens after the deletion of the page row, so it's possible for the page to be deleted before the image, and it's possible for this situation to be user-visible.

If everything seemed fine after a few minutes then it was probably just slow rather than broken.

(In reply to comment #1)

All 3 files I deleted in the last 5 minutes however are different, something
weird happened (or rather, didn't happen). The category member links were kept
and after refreshing they're still showing up in the category (although the
thumbnail is broken, and opening the link shows a "This page was recently
deleted" message).

When an image is deleted, the page_touched for the categories it is in are not updated. It's probably been that way since categories were invented, I confirmed it back to 1.17. The code to do this purge is just not there. This is not related to the other reports.

(In reply to comment #5)

After two minutes, the first five files are still highlighted in gray with
"working". The last six are all red. If recent experience is true here, they
will remain "working" indefinitely -- I once left them overnight without
change.

Maybe an unexpected result from api.php caused the gadget to throw an exception. In any case, failure to show an informative error message is a gadget problem, not a server problem.

If I then reload the DR page, several things can happen.

  1. In this case, they all show up red. This is true less than half the time
  2. In most cases some or all will will remain blue.

There's a few possible causes of that.

2a) Then if I go to the file page, in some cases the file will be intact.
2b) In other cases, the file will appear intact, but the pulldown will start
with "Undelete", showing that the file is actually gone. In that case, using
"Purge" on the pulldown will turn it red on the DR page.

The deleted status used by the skin comes from the slave DB, so this could be either slave lag or an incomplete delete operation.

(In reply to comment #14)

from within function "LocalFile::delete". Database returned error "1205: Lock
wait timeout exceeded; try restarting transaction (10.0.6.41)".

This is the site_stats update, an exception thrown here will cause the memcached file cache update and LocalFile::purgeThumbList() to be skipped. Those two things are done in the same transaction, and LocalFile::purgeThumbList() is quite slow. But I'm not sure if it will explain all of the problems on this bug report. I'll look at dberror.log, it will tell us whether this lock wait timeout is common.

LocalFile::delete DB error counts per day:

 23 in Feb 1
  7 in Feb 3
  1 in Feb 5
  3 in Feb 6
  6 in Feb 9
 26 in Feb 10
  1 in Feb 11
  2 in Feb 12
 91 in Feb 13
  7 in Feb 14
  5 in Feb 15
  7 in Feb 16
  1 in Feb 17
  4 in Feb 19
 96 in Feb 20
 27 in Feb 21
  9 in Feb 22
 48 in Feb 23
 90 in Feb 24
 72 in Feb 25
180 in Feb 26
 13 in Feb 27
 59 in Feb 28
 28 in Feb 29
 29 in Mar 1
 33 in Mar 2
 30 in Mar 3
  2 in Mar 4
 13 in Mar 6
  3 in Mar 7
  5 in Mar 8
 10 in Mar 9
  3 in Mar 10
  4 in Mar 11
  9 in Mar 12
  8 in Mar 13
  4 in Mar 14
 56 in Mar 15
 23 in Mar 16
 38 in Mar 17
 50 in Mar 18
216 in Mar 19
 49 in Mar 20
  2 in Mar 21
  5 in Mar 22

Note the low counts around March 8, that implies it's not responsible for the original report.

The logs show lots of lock wait timeouts for other things that write to site_stats, but LocalFile::delete() seems like the most likely root cause. I patched it in r114442, we'll see if that fixes that part of this bug report at least.

wiki wrote:

It is far too soon to declare victory, but the problem may have gone away.

In case that it is not correct, two notes:

I would be careful of diagnosing from the counts. There are only half a dozen Admins who close most of the DRs. I know that I have cleared many fewer -- hundreds, perhaps as many as several thousand -- DRs than I would have if this bug were not slowing me down. I would not be surprised if my colleagues have also slowed down or stopped doing DRs. That is certainly evident from the backlog.

Note, again, that this a new problem in the last six weeks or eight weeks. I would look for some change around February 1 as the cause.

Jim

wiki wrote:

I have deleted about five hundred files using DelReqHandler this morning with no sign of the trouble detailed above.

Thank you very much for fixing this significant problem.

Jim

The logs show that total lock wait timeouts have dropped to a small fraction of what they were before r114442.