Page MenuHomePhabricator

Statistics page falsely reports no monuments for some datasets
Open, Needs TriagePublic

Description

As noticed in T286354 despite the API showing that there are monuments in the "in-com_en" dataset the statistics page reported "Datasource (in-com, commons) is configured, but no monuments are in the database ".

The source is likely att least partially due to the dataset becoming named "in-com_en" but the statistics looking for "in-com_commons" i.e. in one case using the language of the site/dataset and in the other the project. While this is fine for lists on Wikipedia it fails on Commons and Wikidata.

"pt-wd_pt" suffers a similar fate despite the name of the dataset matching.

Event Timeline

Looks like there are two distinct issues here

  1. The error message links to "in-com_commons" when the dataset is called "in-com_en". (similar for my_en, also on Commons).

The reason for this is that the config says "lang=commons" so the naming of the ocnfig files has in fact broken the pattern expected here and elsewhere. easiest solution is probably to rename the file "in_commons" and "my_commons". [I have a vague memory of setting lang=en causes havoc even i project=commons later].

  1. The statistics query fails.

The statistics query looks for hits in the database with country="in-com", lang="commons". But we have mapped the sql as country="in", lang="en". If we look at this diff we see that the number of monuments in the in_en dataset was reported to have been increased drastically, this is where the in-com monuments are being reported. [pt-wd_pt is similarly being reported as part of pt_pt].

Comparing to e.g. se-fornmin_sv we see that this actually maps to country="se-fornmin" so by that pattern we should expect the current dataset to be mapped to country="in-com". The only problem with this is that searching the api for hits with country=in will not return any of these as hits, instead requiring coutnry=in-com. This is already the case for e.g. Sweden where country=se returns no hits despite there being four datasets hidden under more convoluted names. This is however more problem of us calling the field "country" rather than the more correct "dataset". If you truly want all of the hits they can always be found with "sradm0=in".

This still leaves the issue of the missmatched language field

Lokal_Profil renamed this task from Statistics page falsely reports no monuments to Statistics page falsely reports no monuments for some datasets.Aug 28 2021, 10:06 PM

Change 715320 had a related patch set uploaded (by Lokal Profil; author: Lokal Profil):

[labs/tools/heritage@master] Correct sql maping of country/dataset for various configs

https://gerrit.wikimedia.org/r/715320

Change 715320 had a related patch set uploaded (by Lokal Profil; author: Lokal Profil):

[labs/tools/heritage@master] Correct sql maping of country/dataset for various configs

https://gerrit.wikimedia.org/r/715320

This should fix a bunch of the datasets but notably not in-com as the language is still mismatched. I looked for issues related to not using "language=commons" for my_en but couldn't find anything on a quick scan. T173783 is the place to look for it

Feel free to unsubscribe if the statistics bit wasn't a worry for you. See note in T289929#7316722 above about "in-com" statistics being included on the "in" line.

This is however more problem of us calling the field "country" rather than the more correct "dataset".

Indeed :-(

Change 715320 merged by jenkins-bot:

[labs/tools/heritage@master] Correct sql maping of country/dataset for various configs

https://gerrit.wikimedia.org/r/715320