Shared repositories support for Special:WantedFiles
OpenPublic

Assigned To
None
Priority
Normal
Author
bzimport
Subscribers
Liuxinyu970226, gpaumier, Nemo_bis and 8 others
Projects
Tokens
"Like" token, awarded by Nemo_bis.
Reference
bz6220
Description

Author: Eugene.Zelenko

Description:
Will be great to have ability to list all missing files (on both local wiki and
Commons). It could be used for fixing pages referenced to such files.

In any case (if I understand correctly) list of all images are constructed for
Special:Mostimages, so only check for file existence must be added.


Version: unspecified
Severity: normal

bzimport added a project: MediaWiki-Special-pages.Via ConduitNov 21 2014, 9:16 PM
bzimport added a subscriber: wikibugs-l.
bzimport set Reference to bz6220.
bzimport created this task.Via LegacyJun 6 2006, 1:28 PM
bzimport added a comment.Via ConduitJul 12 2006, 11:44 AM

robchur wrote:

A special page which loaded a list of all images, then checked for file
existence on each, would be too expensive.

A special page which checks for inline inclusion of images which don't appear to
exist won't work with shared image repositories.

daniel added a comment.Via ConduitAug 9 2006, 7:32 PM

It works fine with shared repositories if there's access to the image table of
the repository - which is needed anyway in order to use it, right? SQL mockup:

SELECT page_namespace, page_title, il_to as img_name
FROM imagelinks
JOIN page ON page_id = il_from
WHERE NOT EXISTS( SELECT * FROM image WHERE img_name = il_to )
AND NOT EXISTS( SELECT * FROM commonswiki.image WHERE img_name = il_to )

Using LEFT JOIN instead of NOT EXISTS would be faster for a full list, but
slower if a limit in the hundrets is used.

bzimport added a comment.Via ConduitJan 18 2007, 10:13 AM

robchur wrote:

*** Bug 8683 has been marked as a duplicate of this bug. ***

bzimport added a comment.Via ConduitMay 15 2007, 8:00 PM

robchur wrote:

*** Bug 9924 has been marked as a duplicate of this bug. ***

bzimport added a comment.Via ConduitMar 10 2008, 7:01 PM

Eugene.Zelenko wrote:

*** Bug 13314 has been marked as a duplicate of this bug. ***

bzimport added a comment.Via ConduitApr 15 2008, 8:53 PM

Eugene.Zelenko wrote:

*** This bug has been marked as a duplicate of bug 13702 ***

siebrand added a comment.Via ConduitApr 15 2008, 9:06 PM

Not a dupe. The patch in bug 13702 also does not take shared repositories into account.

demon added a comment.Via ConduitMar 3 2009, 10:29 PM

Broken implementation or not, this is still a dupe to 13702 (or it's a dupe to here, but that bug was marked FIXED :)

  • This bug has been marked as a duplicate of bug 13702 ***
gpaumier added a comment.Via ConduitDec 28 2009, 7:38 PM

Reopening the bug and making it explicit that it requests support for shared repos.

Peachey88 added a comment.Via ConduitJun 20 2010, 1:17 AM
  • Bug 15688 has been marked as a duplicate of this bug. ***
Ilmari_Karonen added a comment.Via ConduitDec 4 2010, 4:14 PM

r77725 at least makes images on shared repos show up as struck-out bluelinks instead of redlinks in the output. It does nothing to fix the actual problem, but at least now you can visually tell the false positives apart from the actually missing files.

demon added a comment.Via ConduitFeb 2 2011, 2:50 PM
  • Bug 27107 has been marked as a duplicate of this bug. ***
IAlex added a comment.Via ConduitApr 17 2011, 8:43 AM
  • Bug 28580 has been marked as a duplicate of this bug. ***
Nemo_bis added a comment.Via ConduitApr 19 2011, 8:58 PM

This is not an enhancement request, the page like it is just doesn't make any sense.
Example: http://meta.wikimedia.org/wiki/Special:WantedFiles

bzimport added a comment.Via ConduitMay 4 2011, 8:02 AM

bugzilla.wikimedia wrote:

This page as it is lends itself nicely towards amending it to a "List of files used from remote (shared) repositories" one - see bug 28807

G.Hagedorn added a comment.Via ConduitDec 31 2011, 1:46 PM

(In reply to comment #12)

r77725 at least makes images on shared repos show up as struck-out bluelinks
instead of redlinks in the output. It does nothing to fix the actual problem,
but at least now you can visually tell the false positives apart from the
actually missing files.

Given that this has been achieved, I wonder whether the bug cannot be closed by simply adding a filter option to hide the struck-out bluelinks? I have no insight into the code, but it seems the filter could be added with very little performance loss, provided we don't expect the precise number of returns and the filter automatically switches to a high browsing interval (2000-5000), and adds an explanation like:

"2000/ACTUAL NUMBER files have been found that are not present in the local wiki. Of these, some or many are available in a shared file repository. These are not shown below. As a result, the number or missing files shown is variable."

This may be not ideal, but clearly better than the present consistent, but rather useless behavior. Who is likely to browser through 100s of pages of struck-out blue links to find the truly missing red-links? In fact on metawiki nobody seem to be doing this, so many broken links exist...

Mark: you changed priority from Highest to Low without arguing - I think it would be better interaction with the community if you could argue or comment why. In some of your changes that may be evident from previous discussion, here I think not. You may well have much more information than Jan Kucera. Please share it.

Nemo_bis added a comment.Via ConduitDec 31 2011, 2:03 PM

(In reply to comment #18)

Mark: you changed priority from Highest to Low without arguing - I think it
would be better interaction with the community if you could argue or comment
why. In some of your changes that may be evident from previous discussion, here
I think not. You may well have much more information than Jan Kucera. Please
share it.

It's not a matter of interaction with the community, you probably missed bug 23816.
As a member of the community who voted for this bug, I'd rather mark it lowest priority or LATER, and disable the special page entirely on WMF wikis (see bug 31491).

G.Hagedorn added a comment.Via ConduitDec 31 2011, 2:46 PM

a) I certainly miss bug 23816 if nobody is referring to it. Thank you for doing so!

b) There are certainly multiple "communities" with different opinions here.

c) I don't see through this at all. Either the bug should be closed, and a new one opened, or ... The largest Wikipedias may have reached a number of broken file links that make this functionality less likely to be essential, but smaller Wikis can substantially improve their quality by fixing these errors. I believe many who voted for this bug see this as an important function, even if Nemo_bis does not. It is widely agreed that the present implementation is broken. The bluelink-solution is a very good step, but it is still offputting potential users (the first pages are usually all clean). I am opening a new Bug 33446 in an attempt to focus on my proposal for a possible solution that makes it more likely that editors are willing to research fix broken file links.

I am sure I have overlooked many other things :-)

Nemo_bis added a comment.Via ConduitDec 31 2011, 3:23 PM

(In reply to comment #20)

b) There are certainly multiple "communities" with different opinions here.

Questionable.

c) I don't see through this at all. Either the bug should be closed, and a new
one opened, or ...

...we could close this and don't open any.

The largest Wikipedias may have reached a number of broken
file links that make this functionality less likely to be essential, but
smaller Wikis can substantially improve their quality by fixing these errors.

I don't see any usefulness in this page on any of the (many) small projects I'm active in, now that there's the tracking category.

I
believe many who voted for this bug see this as an important function, even if
Nemo_bis does not.

Not really, those votes are very old and they all came before the tracking category (mine too).

It is widely agreed that the present implementation is
broken. The bluelink-solution is a very good step, but it is still offputting
potential users (the first pages are usually all clean). I am opening a new Bug
33446 in an attempt to focus on my proposal for a possible solution that makes
it more likely that editors are willing to research fix broken file links.

:/

G.Hagedorn added a comment.Via ConduitDec 31 2011, 4:15 PM

(ignoring what is best ignored:) I disagree that Special:WantedPages is redundant.

However, the basic assumption that it is easier to work by page than by file is, in my opinion, erroneous. A missing file often occurs on dozens of pages. Look at Metawiki (there for multilinguality mostly). In other cases it is because repo files are renamed without keeping redirects. Or, out of old habit, deleted and re-uploaded under a different name.

In cases where a file is missing on dozens of pages, I consider an improved Special:WantedPages desirable.

Bawolff added a comment.Via ConduitJan 1 2012, 10:13 AM

So I have some ideas how to fix this.

Basically, GlobalUsage stores what images that don't exist locally are in use. So I was thinking a query something like:

select '6' as namespace, gil_to as title, count(*) as value from globalimagelinks LEFT JOIN image on gil_to = img_name where img_name is null and gil_wiki = 'jawikinews' group by gil_to order by count(*) DESC;

(Using jawikinews as an example, since it's a smallish size wiki (5480 entries in global usage) thus I can easily test these queries on toolserver). 6 == NS_FILE.

This seemed to work, however with one problem. Image redirects were still included. I'm not sure if that's a globalusage issue (should the links be to the target image) or if its intentional behaviour. Filtering those out in the sql gives:

select '6' as namespace, gil_to as title, count(*) as value from globalimagelinks LEFT JOIN image on gil_to = img_name LEFT JOIN page on (gil_to = page_title and page_namespace=6) where img_name is null and gil_wiki = 'jawikinews' and (page_is_redirect is null or page_is_redirect = 0) group by gil_to order by count(*) DESC;

However, that seems to slow down the query by quite a bit (10 seconds went to 2 minutes). OTOH, the query is slow regardless, and its going to be cached (I'm not sure how slow is too slow). This still would mess up on some edge cases though, such as if the page is a redirect to a non-existant file (or even to something not in NS_FILE). [And of course it doesn't address the more general problem of files from Foreign repos in general. I'm not sure if the general problem is addressable without a schema change]

So possible way forward - Add to GlobalUsage extension a new special page that overrides the built in special:wantedfiles with the new query. Even with the first query i mentioned, it would cut down on false positives significantly.

Krinkle added a comment.Via ConduitJan 1 2012, 11:15 PM

So it determines that a remove file exists by checking if it is used anywhere according to global usage. That's a smart idea. Although maybe not semantically correct, it should be good in practice.

If there is a link to an image on a local wiki and the image doesn't exist on the local wiki, it's going into global usage.

One problem though, right now the system works in such a way that if a file exists neither locally nor in the repository, globalusage catches it, not the local wiki (meaning, it's added to GlobalUsage as a redlink, not to the local wiki as a redlink). This is means four things.

Three good things, which would hold us back from changing this behaviour

  • This is used to fix things if a file in the repo was deleted and is restored, the usage in globalusage is still there and can be restored if needed
  • This is used by gadget authors to track global usage. They make a comment in the script with the [[File:]] syntax in it with an inexisting file name. Requesting global usage for it will yield locations of copies of the script. This one can be worked around by uploading a bogus image to the repo, were this behavior to change and only tracking usage of existing images.
  • It acts a little bit like a global WantedFiles, files that are wanted by multiple wikis.

One bad thing that can compromise Bawolff's proposal:

  • Being in globalfileusage does not mean the file exists there...
Krinkle added a comment.Via ConduitJan 1 2012, 11:16 PM

..

  • Being in globalfileusage does not mean the file exists there..., just like an entry in the local *links table doesn't mean the target exists.

On the other hand, if a connection to globalfileusage is possible, perhaps a connection to the actual repository wiki database is possible as well ? One could (ahum) "simply" check the commonswiki database.

Bawolff added a comment.Via ConduitJan 1 2012, 11:33 PM

[mid air collision]

This is used to fix things if a file in the repo was deleted and is restored,
the usage in globalusage is still there and can be restored if needed

I'm not sure I understand. Do you mean If a file at commons is deleted then
restored? the outer join on image should take care of that (I'm assuming that global usage is in the same db as commons is). If you mean the
local file was deleted/restored I assumed that would re-add/delete the entries
in global usage. Is that incorrect?

  • This is used by gadget authors to track global usage. They make a comment in

the script with the [[File:]] syntax in it with an inexisting file name.
Requesting global usage for it will yield locations of copies of the script.
This one can be worked around by uploading a bogus image to the repo, were this
behavior to change and only tracking usage of existing images.

Hmm, that is an interesting hack. At the end of the day, those would still
appear in special:wantedfiles if it was working properly. I don't really think
we should worry too much about that, having special:wantedfiles into a somewhat
working direction even with such links is an improvement over the current
situation.

It acts a little bit like a global WantedFiles, files that are wanted by
multiple wikis.

Well in my example query i filter by gil_wiki to do only one wiki. But we could
also make a special:globallywantedfiles which gives the most wanted file across
all the wikis.

One bad thing that can compromise Bawolff's proposal:

  • Being in globalfileusage does not mean the file exists there...

I'm not sure I know what you mean. My proposal relies on the fact that there
are entries in globalusage for files that don't exist on the commons repo.

Krinkle added a comment.Via ConduitJan 2 2012, 12:00 AM

Can you join IRC for sec ?

gerritbot added a comment.Via ConduitJul 3 2014, 7:42 AM

Change 143835 had a related patch set uploaded by Brian Wolff:
Make Special:Wantedfiles not include foreign false positives.

https://gerrit.wikimedia.org/r/143835

matmarex added a comment.Via ConduitAug 11 2014, 10:33 AM
  • Bug 69391 has been marked as a duplicate of this bug. ***
Nemo_bis awarded a token.Via WebDec 12 2014, 8:04 AM
demon removed a subscriber: demon.Via WebDec 16 2014, 7:57 PM
Liuxinyu970226 added a subscriber: Liuxinyu970226.Via WebMar 10 2015, 2:07 PM

Add Comment