Page MenuHomePhabricator

Harvesting of several countries does not work because of Namespace filtering issue
Closed, ResolvedPublic

Description

The harvesting of several countries does not work, because the filtering Namespace is unknown:

Working on countrycode "pl-old" in language "pl"
ERROR: u'Namespace identifier(s) not recognised: 102'

Errors:

ERROR: u'Namespace identifier(s) not recognised: 102'
Unknown error occurred when processing country pl-old in lang pl
--
ERROR: u'Namespace identifier(s) not recognised: 104'
Unknown error occurred when processing country es in lang es
--
ERROR: u'Namespace identifier(s) not recognised: 102'
Unknown error occurred when processing country it in lang it
--
ERROR: u'Namespace identifier(s) not recognised: 104'
Unknown error occurred when processing country bo in lang es
--
ERROR: u'Namespace identifier(s) not recognised: 104'
Unknown error occurred when processing country uy in lang es
--
ERROR: u'Namespace identifier(s) not recognised: 104'
Unknown error occurred when processing country mx in lang es
--
ERROR: u'Namespace identifier(s) not recognised: 104'
Unknown error occurred when processing country cl in lang es
--
ERROR: u'Namespace identifier(s) not recognised: 104'
Unknown error occurred when processing country ru-old in lang ru
--
ERROR: u'Namespace identifier(s) not recognised: 104'
Unknown error occurred when processing country sv in lang es
--
ERROR: u'Namespace identifier(s) not recognised: 104'
Unknown error occurred when processing country ru in lang ru
--
ERROR: u'Namespace identifier(s) not recognised: 104'
Unknown error occurred when processing country co in lang es
--
ERROR: u'Namespace identifier(s) not recognised: 104'
Unknown error occurred when processing country pa in lang es
--
ERROR: u'Namespace identifier(s) not recognised: 104'
Unknown error occurred when processing country ve in lang es
--
ERROR: u'Namespace identifier(s) not recognised: 104'
Unknown error occurred when processing country ar in lang es

Event Timeline

JeanFred claimed this task.
JeanFred raised the priority of this task from to Needs Triage.
JeanFred updated the task description. (Show Details)
JeanFred subscribed.

That’s weird: the template Zabytki wiersz is mainly transcluded in namespace Wikiprojekt, whose number is 102…

Also, this is the old-pl config, there is also a pl.

@Multichill, @Effeietsanders, would you know more about this − or would you know who might know more about this?

JeanFred renamed this task from Harvesting of {country: pl-old, lang: pl} does not work because of Namespace filtering issue to Harvesting of several countries does not work because of Namespace filtering issue.Aug 27 2015, 9:47 AM
JeanFred updated the task description. (Show Details)
JeanFred set Security to None.

Problem is bigger than I thought >_>

I dunno how all the -old tables came in there. You would have to check the git history for that. Best steps: 1. Check if the template still exists and if it still has transclusions. 2. If it's not used, just remove the source, if it's used, switch the namespace.

Hi! The database of Russian monuments needs to be modified. The country "ru-old" is no longer needed. These lists are obsolete, and I am not sure that they even exist. The rules for the country "ru" are outdated as well.

Up-to-date Russian lists are located at https://ru.wikivoyage.org/wiki/Культурное наследие России/ (most of the lists are subpages). The lists are organized using the {{monument}} template. If necessary, I can provide the mapping of template's parameters onto the fields that the bot reads.

Fine print: this year we also have the lists of Crimean monuments that are organized as subpages of https://ru.wikivoyage.org/wiki/Культурное наследие/
The country name is omitted in the pagetitle in order to avoid the delicate issue of which state Crimea belongs to. These lists could be uploaded into the database as well, although they are mostly repeating Ukrainian lists, so there will be lots of overlaps. Skipping the Crimean lists for the time being can also be a solution.

So, the namespace error is fixed? What is the new error now? Which countries are affected?

The harvesting of Russian monuments is in a separate task now:
https://phabricator.wikimedia.org/T110665

...but I do not see much activity there=(

Where is the code for erfgoed bot? I want to check it.

MariaDB [s51138__heritage_p]> select count(*) from monuments_all where lang='es' and country='es';
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.00 sec)

Yes. Here is what happens: when running the update for all countries, the ones listed in T110420 crash right on. This causes the table for that country to be reinitialised to 0. This bug does *not* happen when running the bot on this country only (which I did a couple of times, see P1940). This puzzles me greatly.

@Emijrp, do note that when/even if I manage to get these countries working as part of the full run, the issues listed in P1940 will persist (I’m thinking of the 277 missing primkeys)

Try changing line 482 in update_database.py from:

filteredGen = pagegenerators.NamespaceFilterPageGenerator(
        transGen, countryconfig.get('namespaces'))

to:

filteredGen = pagegenerators.NamespaceFilterPageGenerator(
        transGen, countryconfig.get('namespaces'), site=site)

Change 235260 had a related patch set uploaded (by Jean-Frédéric):
Specify site to use when specifying NamespaceFilterPageGenerator

https://gerrit.wikimedia.org/r/235260

Change 235260 merged by Jean-Frédéric:
Specify site to use when specifying NamespaceFilterPageGenerator

https://gerrit.wikimedia.org/r/235260

Try changing line 482 in update_database.py from:
[snip]

Thanks! Merged. I’m rescheduling an update in a short bit.

@Emijrp, do note that when/even if I manage to get these countries working as part of the full run, the issues listed in P1940 will persist (I’m thinking of the 277 missing primkeys)

@JeanFred The missing primkeys error is due to empty cells in the Spanish lists, not a bot error. So, it will be solved when wikipedians improve the tables.

@Emijrp, do note that when/even if I manage to get these countries working as part of the full run, the issues listed in P1940 will persist (I’m thinking of the 277 missing primkeys)

@JeanFred The missing primkeys error is due to empty cells in the Spanish lists, not a bot error. So, it will be solved when wikipedians improve the tables.

I know :) I suspected that these primkeys missing might be partially reponsible for (es) monuments missing − but that was silly of me, 277 is not that much compared to the number of monuments in Spain (which I do not know, but imagine bigger than that ;)

Try changing line 482 in update_database.py from:
[snip]

Thanks! Merged. I’m rescheduling an update in a short bit.

Job in progress. Will take a while. I will monitor it a bit to see if this issue is now fixed.

Try changing line 482 in update_database.py from:
[snip]

Thanks! Merged. I’m rescheduling an update in a short bit.

Job in progress. Will take a while. I will monitor it a bit to see if this issue is now fixed.

Harvesting is going well. Closing this as Resolve. Thanks Emilio for finding the bug!

May I remind you that the settings for the Russian monuments should be modified? Even if the harvesting works with the old settings, it makes little sense for us, because the majority of coordinates are wrong. Some objects are shifted from Central Russia to the Pacific coast, and so on. We should use current lists for harvesting, not some old outdated stuff.

If changing the settings is not possible at the moment, I would rather remove Russian monuments from the database completely because we should not send our participants to wrong places asking them to take photos of buildings that are not cultural heritage or do not exist at all.

@JeanFred Can you change the Russian config from Wikipedia to Wikivoyage? https://phabricator.wikimedia.org/T110665

@JeanFred The monuments_all table doesn't exist right now!