Page MenuHomePhabricator

dplbot untagged uncats: polluted data on toollabs
Open, Needs TriagePublic


I want to report an ongoing problem with the Untagged Uncategorized Articles list ( frequently detecting and listing pages which it should not. I've tried to report this several times at en's technical help desk, but it's never been resolved and it's recently been suggested to me that the problem may be at the Toollabs end instead -- data corruption issues in the Toollabs replication of the database, rather than on en itself -- so I'm reporting it here this time.

There are several different versions of this problem:

(1) Pages which are properly categorized, but the last edit in the history is a revert of page-blanking vandalism -- but only if that revert was performed by a bot instead of a human editor. Examples: [[Black Power]]. These can usually be cleared with a null edit, but it would still be preferable if they not show up at all.

(2) Pages which are properly categorized, but the last edit in the history is a page move to a new title. Examples: [[Brain vital signs]], [[Bingham Road railway station]]. [[Andreas Nödl]]. This can sometimes be cleared with a null edit, but other times that fails and I have to go all the way to temporarily deleting and then restoring the article to actually clear it from the list.

(3) Random regurgitated clusters of pages that have already been deleted, sometimes months earlier; the common element being that at any given time, the pages of this type which appear on the list were always deleted right around the same time as each other. Examples: [[Axiom Landbase Pvt. Ltd.]], [[BatissForever]], [[Bhaiyato The Hobbit]]. These end up being entirely impossible to clear from the list -- restoring doesn't work, redeleting doesn't work, creating a placeholder page filed in the "Temporary maintenance holdings" category doesn't work -- and I end up having to just work around them as permanent speed bumps on the list until they somehow decide to clear on their own. Occasionally, I've even had to ask JaGa to hardcode such titles directly into the bot programming as specific exclusions -- but he hasn't been around much lately, so that can't be the permanent answer to this.

(4) Random clusters of recently created articles; again, the common element being that at any given time, the pages of this type which appear on the list were always created right around the same time. This error also has an extremely odd tendency to hit soccer/football players and plant or animal species far more often than any other type of article (although it's not restricted exclusively to those classes of topic). Examples: [[Antaeotricha ogmosaris]]. As in #2, these vary in whether a null edit will clear them, or whether I have to go to a full-on delete/restore.

(5) Random clusters of former articles which have been converted into redirects. Examples: [[Magic Lamp]], [[Magic lamp]], [[Magical lamp]]. Sometimes, but not always, a delete-restore will clear them; nothing else will.

(6) Soft redirects to Wiktionary, where an editor has tried to convert the redirect into a DICDEF article but then another editor has reverted it back to a soft redirect to Wiktionary again: for some reason, the uncats list loses the ability to bypass them as it normally does with soft redirects, but now considers them to be full articles. The only solution that has ever worked in this case was to add them to the "Temporary maintenance holdings" category. I haven't seen any examples of this yet in the current batch, although most of the current contents of "Temporary maintenance holdings" are prior examples of it.

(7) Random clusters of longstanding articles where I can't figure out any discernible reason at all for the error; the common element in this case is that when this happens, the articles involved are all in the same category as each other. Examples: [[Zorlovići]], [[Čardak, Pljevlja]], [[Čavanj]], [[Čerjenci]], [[Čestin, Montenegro]], [[Đuli]], [[Đurđevića Tara]] and [[Ljuće]], all of which are and have always been properly filed in the category "Populated places in Pljevlja Municipality" (which also contains many other pages that aren't being detected as uncategorized, so the category itself isn't the problem.) It's worth noting that the last time I can recall seeing this, it also involved populated places in a Slavic-language country (although it was Russia that time rather than Montenegro), although it has at times hit other categories as well.

It is really frustrating to constantly have to deal with all of these, because they end up sucking up unnecessary amounts of time and energy. I should be able to power through a tagging batch in 20-30 minutes at most, but these issues invariably turn it into a two-to-three-hour job because I have to stop and investigate and null-edit or delete-restore pages I shouldn't even be seeing on the list at all. And, in fact, I should be able to just let a bot loose on the list and not actually have to devote any of my own time and energy to tedious tasks like this at all -- but as long as errors like these are polluting the list, I can't.

I'd really appreciate it if somebody could actually figure out how to fix this finally. Thanks.

Event Timeline

valhallasw renamed this task from polluted data on toollabs to dplbot untagged uncats: polluted data on toollabs.May 26 2016, 4:50 PM
valhallasw added subscribers: russblau, JaGa.

Thank you for filing a bug with such an extensive description! I'm not familiar with the dplbot tool, so I'm not sure what could cause these kinds of issues. When it comes to the labs database, there are two issues to keep in mind:

The replica drift could cause odd issues, but it's a) rare and b) random, so I don't think it's likely that your issues are caused by this. The replag could be an issue, and there could be an issue with the dplbot code where somehow reverts are not processed correctly.

Just for the record, the current replag isn't the issue at all -- these are all ongoing problems that I've encountered even when replag was at zero. Replica drift may be the issue, or possibly there are corruption issues in en's data in the first place. But my past bug reports over there have never resolved the issues, because nobody ever seems to find the source of the problem -- so it has been suggested a couple of times recently that I try here instead.

Replag is down to zero at the moment, so I'm going to take the opportunity to add a bit more context. I haven't gotten all the way through the entire batch yet today, but here are all the articles from A through F that should not actually be appearing on the list, as they fall under one of the issues I noted above: Andreas Nödl, Axiom Landbase Pvt. Ltd., BatissForever, Bhaiyato The Hobbit, Bingham Road railway station, Byjus classes, Carrhotus malayanus, Changed people, Changed person, Christ (Lithograph), Château de Bollwiller, Château de Butenheim, Cyclone Victor(1986), Dee Sterling, Diatraea amazonica, Diatraea amnemonella, Diatraea angustella, Diatraea balboana, Diatraea cayennella, Diatraea colombiana, Diatraea entreriana, Diatraea flavipennella, Diatraea guapilella, Diatraea luteella, Diatraea maritima, Diatraea morobe, Diatraea obliqualis, Diatraea pittieri, Diatraea rosa, Diatraea savannarum, Diatraea silvicola, Diatraea umbrialis, Divegrass, Divisor theory, Eddie Mao, Egger Island, Egyptian revolution of 1919, Electrical Trades Union (United Kingdom), Escalo Frio, Faithlessly, Fimbriata.