Page MenuHomePhabricator

Consider restricting the 'usage' to certain namespaces
Open, Needs TriagePublic

Description

One of the four core-metrics of the Wikiloves tool suite, as documented in Tool:Wikiloves#Metrics is “Images used in the wikis How many of these uploads are in use in the wikis”.

This is achieved by checking whether the image is in the gil_to column in the globalimagelinks table (code).

This method will obviously considers all usages as equal, including transclusions in the Wikipedia: or Project: namespaces. If a competition would, for example, build in the Wikipedia: namespace galleries of all pictures uploaded (per monument/location/uploader/etc.), then this would lead to ~100% usage, which might be surprising.

Two questions:

  • Should the query discriminate on namespace when calculating the usage?
  • Which namespace should be used?

Some relevant previous work:

  • Glamorous will offer a tickbox “Namespaces − Show file usage in article namespace only”
  • Baglama will consider all namespaces

(Per its design, the wikiloves tool cannot make it a user-selected option.)

Event Timeline

JeanFred created this task.Aug 27 2020, 8:03 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 27 2020, 8:03 AM

This seems reasonable ; indeed WLE 2020 Ukraine achieving 100% usage is probably not what is expected :)

While I agree that the 100% example above is not what one might expect from the metric, I do want to point out that it _does_ follow the metric definition: whether the file is in use or not, which is a very clear-cut definition. So this is less an implementation issue than a definition issue.

We thus need a new definition, and personally I am uneasy about, essentially, defining which global use is worthy and which is not. It does sound dramatic, but it does boil down to that, and I do not think this sort of decision quite belongs in the hands of the implementer.

While it sounds simple (“only consider Article namespace, ie NS 0”), I also think it is more complicated than that. With rTHER, I have the configuration for 117 datasets on various wikis, including where the harvesting bot should find the monuments lists − arguably, usage in the monuments lists can be considered 'worthy'.
78 datasets are indeed defined for NS 0¹ only ; yet others have a variety of namespaces, here’s the breakdown:

 1 [0,100]
 1 [0,2]
 8 [0,4]
 1 [102]
11 [104]
 1 [118]
 7 [4]

Now, I have no idea what are these namespaces − turns out:

  • 104 is the “Page” Wikisource NS, and also the “Anexo” NS on es.wp.
  • 108 is the Page NS on it.ws.
  • 100 is the Portal NS on dewiki (probably 'unworthy'?) but also the “Appendix” NS on en.wikt (probably 'worthy'?)

Maybe I am making things more complicated than they need to be − most 'worthy usage' /is/ probably in NS0, and it’s clear that User: namespace should not be considered 'worthy'. But the question is more complicated than it seems, obvious answers turn out not to be so obvious, and I don’t necessarily want to be the one imposing my views. :-)

¹ My jq is rusty but this worked:

cat *.json | jq -c '.table, .namespaces'| grep -B 1 '\[0\]' | grep monuments | wc -l

Personally, I would just exclude a few of the namespaces or fork them into a seperate percentage (i.e. % used on Project and User pages): Project space (which is 4?) and user space (which is 2?) Neither of them is supposed to be used for "content" in the browsable sense -- for the GLAM tools, its less of a big deal that they track these secondary uses because the high level metric that most of them want to report is pageviews - but in this case, the usage rate is way more important because of the metric is an indicator of organizing success.