Page MenuHomePhabricator

Add blacklist to ErfgoedBot categorisation process
Closed, ResolvedPublic

Description

The categorisation process, in particular far-reaching options like method D (from upper categories of monument list page), may return unwanted categories. A typical example is [[Category:Noindexed pages]].

Although the actual root issue lies with the original data (ie the monuments lists), a mitigating option is to define a blacklist of categories that ErfgoedBot should never categorise to.

Event Timeline

Change 328348 had a related patch set uploaded (by Jean-Frédéric):
Filter out blacklisted categories during categorisation

https://gerrit.wikimedia.org/r/328348

I like T153746 more. If you go forward with this, you might as well use https://commons.wikimedia.org/wiki/User:Multichill/Category_blacklist as a start. That's the one categorizationbot used to use.

I also think T153746 is the way to go. Unless there are non-hidden categories also causing problems)

Then I would suggest we go for both. Hidden should never be added and in addition we have a curated black-list of non-hidden categories.

And I would put the blacklist on wiki, for example at https://commons.wikimedia.org/wiki/User:ErfgoedBot/Category_blacklist . Just put full protection on it against people messing with it.

I updated the patch to work on top of that introduced in T153746. It still relies on a yml store. If we want it to read of wiki then lets create a separate task for that.

Change 328348 merged by jenkins-bot:
[labs/tools/heritage@master] Filter out blacklisted categories during categorisation

https://gerrit.wikimedia.org/r/328348

Mentioned in SAL (#wikimedia-cloud) [2017-09-23T18:31:37Z] <JeanFred> Deploy latest from Git master: f8a1b8a (T153744)