Page MenuHomePhabricator

Duplicate monument numbers overwrite each other, ending up with only the last one.
Closed, ResolvedPublic

Description

When within a country a duplicate monumentnumber is used, possibly due to copy/paste, each new entry will overwrite the previous one which can give very unexpected results.

We should detect this. It will require in the harvester run to keep track of the id's seen, likely also with the page it is seen on, and when duplicate report it, listing both pages for the duplicate.

This should then be made public, so volunteer can fix those errors.

In https://github.com/wikimedia/labs-tools-heritage/blob/master/erfgoedbot/update_database.py at line 558, there is already a comment about a todo to include this.

Event Timeline

Looking more at the code. There are two separate cases.

The first one is the countries based on wikidata. Although it might be good to report those as well, this can also be reported on wikidata itself.

For the ones using wikipedia lists, a function process_country_list (line 481) is called.

That code loops over the pages with the template, and on line 514, it calls process_page. That function now returns the unknown_fields, which is also passed to the function.

process_page is on line 428, which calls process_monument on line 450, which is defined at line line 359.

We need to keep state between all calls. So instead returning the unknown_fields pass a dictionary that contains both the unknown fields, a dictionary of already processed_pages with the id's it has found, and a has of id's that has multiple entries with the pages that it was found on.

Then if the process_monument function founds a duplicate id, it can report it with both pages if the id was not found, or simply add the page.

After all the pages are processed, the reporting can be done based on that.

The mechanism with the dictionary can then be used for similar passing down of data as well.

The code where it checks on primary keys being present is at line 403 which is likely the place to have this check as well.

Change #1251595 had a related patch set uploaded (by Lokal Profil; author: Lokal Profil):

[labs/tools/heritage@master] Detect and report duplicate monument IDs during harvest

https://gerrit.wikimedia.org/r/1251595

Thanks. I looked at the diff. I noticed that with with multiple private keys, you separate them with a - as that is also done in the database. This is ok for internal, but as this will be presented publicly we should use what is known in the community. This will likely require an extra (optional) config field defining the separator.

Change #1251595 merged by jenkins-bot:

[labs/tools/heritage@master] Detect and report duplicate monument IDs during harvest

https://gerrit.wikimedia.org/r/1251595

Change #1254973 had a related patch set uploaded (by Akoopal; author: Akoopal):

[labs/tools/heritage@master] Added hamburg to erfgoedbot config.

https://gerrit.wikimedia.org/r/1254973

removed patch for review as it was for another bug.

Mentioned in SAL (#wikimedia-cloud) [2026-03-19T22:23:47Z] <wmbot~lokal-profil@tools-bastion-15> Deploy d1e2abc (T420076), a196cd6 (T420019), 3f3a858 (T55271)