Page MenuHomePhabricator

incategory search returns years-old false positives
Open, Needs TriagePublic

Description

I searched

incategory:"Files with no machine-readable license" insource:/eview/ -FlickreviewR

on Commons, in an effort to find files in that category that have a review template in source wikitext.

This query returns some old files. By sorting by edit date, I found for example this:

https://commons.wikimedia.org/wiki/File:Korg_Electribe_MX_(EMX-1)_Valve_Force.jpg

It was in that category for less than a minute when it was uploaded in 2010! As soon as this edit https://commons.wikimedia.org/w/index.php?diff=43801464 in the same minute of its upload passed, it was already out of that category.

Yet it still shows up in my search query 10 years later!

The file has been edited more than 10 times over the decade, and was last edited in 2017, so your database should have been updated right?!

I dont know whether this kind of false positives is solely related to the incategory command or not. Please investigate.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

That's certainly odd.

I pulled the indexed document (?action=cirrusdump on any page url) from the search engine and indeed it contained the incorrect categories. Requested a new search doc build (cirrusbuilddoc api query prop) and it did not contain the categories. The current search engine only dates back to 2013, so it wouldn't be possible we have old content from 2010 still in it. I issued a reindex on that page title and the stored doc no longer has the incorrect categories,

I'm truly not certain where those could have come from, and we have the other question of how many other pages are wrong.
It's also odd because there is a process that visits all pages and reindexes them every 2 months or so, estimating from current position and rate that process would have gotten to this page in about 3-4 days, suggesting the last update was ~7 weeks ago.