Page MenuHomePhabricator

Solve ErfgoedBot categorisation problem
Closed, ResolvedPublic

Description

ErfgoedBot’s categorisation task is currently stalled. There is a huge backlog of images for which ErfgoedBot can do nothing (because they have no ID, or no better category can be inferred from the database). This means that it spends hours going through the same images, before getting to the new ones that need it.

It also blocks harvesting, since both are done in one job, and that there can only be one job running − harvesting is not daily, but currently more than 2 days.

I would be tempted to blacklist the countries with a backlog of thousands of files, which have been sitting there for a year (see https://commons.wikimedia.org/wiki/Commons:Monuments_database/Categorization/Statistics )

Thoughts ?

Event Timeline

I think people are no longer aware of how these processes work. Step one is making people aware of this. For example, i would not be surprised that if the Iran community were better aware of what to do, they might actually do it. A clear instruction email with pointers could go a long way. This may require some help in designing category trees on commons.

I think also it would make sense to split the countries in two tiers: one tier where active maintenance takes place, which runs separate from the countries that do not maintain actively. Could we run the second tier weekly or monthly, perhaps? (for the overview pages - harvesting ideally still happens frequently)

finally, if the tool of dudemanfellabro could be improved a bit, and is maintained again, it could go a long way in decreasing the other lists. Right now the problem is in part that you can't 'blacklist' images - so they keep showing up. Also the category tree on commons is polluted because someone has been adding multiple identifiers to certain categories, where the identifier only really is relevant to a part of the category.

this won't solve the whole problem at once, but maybe a little bit?

I would support blacklisting but following that up with an e-mail to the local organisers of blacklisted countries. Positive response and action as a result of those e-mails would take the country of the blacklist. That way those countries are not blocking the others.

It might also be worth splitting the job up into a harvest and a categorise one.

It might also be worth splitting the job up into a harvest and a categorise one.

I initially was not keen on doing that, as I am not sure how well the infrastructure would support parallel processing ; but on second thoughts it should be fine (especially as categorisation does not write to the DB).

Change 377483 had a related patch set uploaded (by Jean-Frédéric; owner: Jean-Frédéric):
[labs/tools/heritage@master] Split categorization out of daily update job

https://gerrit.wikimedia.org/r/377483

Change 377483 merged by jenkins-bot:
[labs/tools/heritage@master] Split categorization out of daily update job

https://gerrit.wikimedia.org/r/377483

Change 378021 had a related patch set uploaded (by Jean-Frédéric; owner: Jean-Frédéric):
[labs/tools/heritage@master] Skip categorisation for some countries

https://gerrit.wikimedia.org/r/378021

Change 378021 merged by jenkins-bot:
[labs/tools/heritage@master] Skip categorisation for some countries

https://gerrit.wikimedia.org/r/378021

Before closing this task we should (possibly as a subtask) reach out to the local organisers of the skipped countries and explain what they need to do to get of the list and why it is important that they do (i.e. the way to find things in commons is through categories)

Mentioned in SAL (#wikimedia-cloud) [2017-09-17T18:28:43Z] <JeanFred> Deploy latest from Git master: b3c3ab0 (T174871)

Change 378641 had a related patch set uploaded (by Lokal Profil; owner: Lokal Profil):
[labs/tools/heritage@master] [WIP] Ensure skipped image categorizations are mentioned in stats

https://gerrit.wikimedia.org/r/378641

Change 378641 merged by jenkins-bot:
[labs/tools/heritage@master] Ensure skipped image categorizations are mentioned in stats

https://gerrit.wikimedia.org/r/378641

Mentioned in SAL (#wikimedia-cloud) [2017-09-21T19:35:40Z] <JeanFred> Deploy latest from Git master: 730d577 (T174871)

@JeanFred Can this be closed?

The actual work on this has been done. What is most likely missing is pinging the organisers whose countries are blacklisted at Commons:Monuments database/Categorization/Statistics and explaining what they need to do if they want to be included.

For good measure we should probably also look at the logs to make sure categorisation has not crashed since we last updated this.

Adding to 2018 since we are likely blacklisting more countries but still haven't pinged organisers about this explaining what they need to do to get of the blacklist.

Adding to 2018 since we are likely blacklisting more countries but still haven't pinged organisers about this explaining what they need to do to get of the blacklist.

Breaking this out as a separate task at T203338